hjortur17's avatar

Remove HTML from JSON object

I can't figure out how to remove HTML tags from JSON object. I'm getting data from a api but in the api there is some html tags. I have tried to json_decode and stuff but that did not work.

So what is the best way to remove html tags in PHP?

Here is example of the response from the api:

    12 => {#251 ▼
      +"date": "07:40"
      +"flightNumber": "<span class="cutoff" title="Brussels">Brussels</span>"
      +"airline": "FI554"
      +"to": "<span class="cutoff" title="Icelandair">Icelandair</span>"
      +"plannedArrival": "<span class="cutoff" title="Departed 07:50">Departed 07:50</span>"
      +"realArrival": "\r\n"
      +"status": null
    }
    13 => {#252 ▼
      +"date": "07:40"
      +"flightNumber": "<span class="cutoff" title="Berlin Tegel">Berlin Tegel</span>"
      +"airline": "FI528"
      +"to": "<span class="cutoff" title="Icelandair">Icelandair</span>"
      +"plannedArrival": "<span class="cutoff" title="Departed 07:52">Departed 07:52</span>"
      +"realArrival": "\r\n"
      +"status": null
    }

and here is how I'm getting this:

$content = json_decode(file_get_contents("https://apis.is/flight?language=en&type=departures"));

dd($content);
0 likes
22 replies
tykus's avatar

You would need to iterate over the keys and apply strip_tags() to the value:

$content = json_decode(file_get_contents("https://apis.is/flight?language=en&type=departures"), true);

$content = collect($content)->map(function ($flight) {
    return collect($flight)->mapWithKeys(function ($value, $key) {
        return [$key => strip_tags($value)];
    })->all();
})->all();

dd($content);

It might not hurt to remove those \r\n newline characters either (using preg_replace it might look as follows):

return [
    $key => trim(preg_replace('/\s+/', ' ', strip_tags($value)))
];
hjortur17's avatar

If I try using your method I get this error: strip_tags() expects parameter 1 to be string, array given

tykus's avatar

What level of the response data are you at; the nested collections should have handled this, but it would depend on the original response?

hjortur17's avatar

If I only do this: $content =file_get_contents("https://apis.is/flight?language=en&type=departures");

This is the response:

"results":[{"date":"\r\n\t\t\t\t\t\t\t\t\t<a href=\"#\" class=\"btn btn-rounded btn-orange show-departed\"><span>Show</span> departed flights <i class=\"icon-right-open-big\" aria-hidden=\"true\"></i></a>\r\n\t\t\t\t\t\t\t\t\t\t","flightNumber":null,"airline":null,"to":null,"plannedArrival":null,"realArrival":null,"status":null},{"date":"\r\n\t\t\t\t\t\t\t\t\t\t\tNo flights found\r\n\t\t\t\t\t\t\t\t\t\t","flightNumber":null,"airline":null,"to":null,"plannedArrival":null,"realArrival":null,"status":null},{"date":"01:35","flightNumber":"<span class=\"cutoff\" title=\"Madrid\">Madrid</span>","airline":"I23661","to":"<span class=\"cutoff\" title=\"Iberia Express\">Iberia Express</span>","plannedArrival":"<span class=\"cutoff\" title=\"Departed 01:28\">Departed 01:28</span>","realArrival":"\r\n","status":null},{"date":"07:20","flightNumber":"<span class=\"cutoff\" title=\"Munich\">Munich</span>","airline":"FI532","to":"<span class=\"cutoff\" title=\"Icelandair\">Icelandair</span>","plannedArrival":"<span class=\"cutoff\" title=\"Departed 07:23\">Departed 07:23</span>","realArrival":"\r\n","status":null},{"date":"07:20","flightNumber":"<span class=\"cutoff\" title=\"Geneva\">Geneva</span>","airline":"FI564","to":"<span class=\"cutoff\" title=\"Icelandair\">Icelandair</span>","plannedArrival":"<span class=\"cutoff\" title=\"Departed 07:23\">Departed 07:23</span>","realArrival":"\r\n","status":null},{"date":"07:20","flightNumber":"<span class=\"cutoff\" title=\"Zurich\">Zurich</span>","airline":"FI568","to":"<span class=\"cutoff\" title=\"Icelandair\">Icelandair</span>","plannedArrival":"<span class=\"cutoff\" title=\"Departed 07:23\">Departed 07:23</span>","realArrival":"\r\n","status":null


and there is more...
Snapey's avatar

Sorry, but that is a really crap API

Obviously scraped from some different places, with large inconsistencies in the responses.

I wouldn't base anything other than an experiment on this.

1 like
Snapey's avatar

strip_tags is not going to really help when you have a field called 'date' and it contains

"\r\n\t\t\t\t\t\t\t\t\t<a href=\"#\" class=\"btn btn-rounded btn-orange show-departed\"><span>Show</span> departed flights <i class=\"icon-right-open-big\" aria-hidden=\"true\"></i></a>\r\n\t\t\t\t\t\t\t\t\t\t

pass that through striptags and you end up with 'date':

    """
   \r\n
   \t\t\t\t\t\t\t\t\tShow departed flights \r\n
   \t\t\t\t\t\t\t\t\t\t
   """
jlrdw's avatar

@snapey geeeeze, I think he's not properly using the api then, you are right.

OP, better roll up your sleeves and start writing some custom import code, good luck.

tykus's avatar

"results":[{"date":"\r\n\t\t\t\t\t\t\t\t\t<a href=\"#\" class=\"btn btn-rounded btn-orange show-departed\"><span>Show</span> departed flights <i class=\"icon-right-open-big\" aria-hidden=\"true\"></i></a>\r\n\t\t\t\t\t\t\t\t\t\t",

What the actual fuck!?!?

Cronix's avatar

I particularly like when there is no date in a result

date    "\r\n\t\t\t\t\t\t\t\t\t\t\tNo flights found\r\n\t\t\t\t\t\t\t\t\t\t"

I'm glad I'm not working with this data. There's just so much wrong with it and horribly inconsistent. You'd probably spend days sorting this out working out all of the edge cases...and then they'll go change something on the output messing up all of your workarounds.

Use a solid, well defined, hopefully versioned api. This one is not.

jlrdw's avatar

@cronix maybe it's infinity. Or "Twilight Zone".

@hjortur17 I think you have somehow got the wrong data, that is not API data, I promise.

jlrdw's avatar

I'm usually leery about clicking any links in the Forum, except known good links.

I might check it out with virustotal first then have a look.

I looked, it appears to be API data for display. Or to display.

Definitely not raw data. There's probably another part somewhere for raw data.

Cronix's avatar

Then why would you make promises and commenting about things you don't actually know because you haven't taken the time to verify it?

"promise" kind of makes it sound like you really actually know what you're talking about.

1 like
jlrdw's avatar

I'll rephrase, Okay it looks like some sort of internal usage data. Not meant to be an application programming interface for just any outside use.

Like: https://www.adoptapet.com/public/apis/pet_list.html

You are setup with them and use the data per their spec. So an API is two way, not just one.

One way usage doesn't fully fill the two way usage, thus not an API. Just the data alone doesn't define an API.

You also have to consider the interface part.

When OP retrieves the data with the proper coding to receive it as intended then it will be an API

Snapey's avatar

ok here we go, grab some popcorn. bllx being spouted again

jlrdw's avatar

Tonight it's chocolate chip cookies from MD.

Enough said, I think OP needs to work on this, whatever it truly is.

hjortur17's avatar

Okay, so I got some of the html out. Now I can't figure out how I can search in the response.

Here is the response:

    8 => array:7 [▼
      "date" => "07:20"
      "flightNumber" => "Geneva"
      "airline" => "FI564"
      "to" => "Icelandair"
      "plannedArrival" => "Departed 07:22"
      "realArrival" => ""
      "status" => null
    ]
    9 => array:7 [▼
      "date" => "07:20"
      "flightNumber" => "Zurich"
      "airline" => "FI568"
      "to" => "Icelandair"
      "plannedArrival" => "Departed 07:21"
      "realArrival" => ""
      "status" => null
    ]
    10 => array:7 [▼
      "date" => "07:25"
      "flightNumber" => "Frankfurt"
      "airline" => "FI520"
      "to" => "Icelandair"
      "plannedArrival" => "Departed 07:19"
      "realArrival" => ""
      "status" => null
    ]

and I have a request for FI520, how can I search in this JSON data?

hollyit's avatar

That site is not a true API. They are a scrapping service, that just parses out HTML code and tries to come up with an API. There are enormous problems basing anything off of that service. The two biggies:

  1. You could be violating the source sites terms of service, hence opening yourself to legal jeopardy

  2. If the original site does a redesign, or even just change a bit of HTML, well your API is now broken.

That apis.is site is actually opened source and you can view their code, including the endpoint for the flight here:

https://github.com/apis-is/apis/blob/master/endpoints/flight/index.js

Just because they do some DOM walking and through it in a JSON format doesn't make it an API, same way as if I take some wheels off a grocery cart and put them on a wood box, it doesn't make a car. Your best bet would be to either look for a true API, or write your own DOM parser from that site (though issue 1 above will remain). You can use PHP's built in DOM parsing classes, or a 3rd party like Symfony's DOMCrawler package.

1 like
hjortur17's avatar
hjortur17
OP
Best Answer
Level 14

I stopped using that API service and found another one. Did not need to strip anything from it.

Please or to participate in this conversation.