Get top 100 words / keywords from a text with PHP

This is useful if you want to create dynamic keywords from content or just sort words by appearing frequency in a text or html by excluding very common words like "the on and to ...". You can give custom limit of words to return and custom words to ignore.

<?php


function top_words($str, $limit=100, $ignore=""){

	if(!$ignore) $ignore = "the of to and a in for is The that on said with be was by";
	
	
	$ignore_arr = explode(" ", $ignore);

	$str = trim($str);
	$str = preg_replace("#[&].{2,7}[;]#sim", " ", $str);
	$str = preg_replace("#[()°^!\"§\$%&/{(\[)\]=}?´`,;.:\-_\#'~+*]#", " ", $str);
	$str = preg_replace("#\s+#sim", " ", $str);
	$arraw = explode(" ", $str);
	
	foreach($arraw as $v){
		$v = trim($v);
		if(strlen($v)<3 || in_array($v, $ignore_arr)) continue;
		$arr[$v]++;
	}
	
	arsort($arr);
	
	return array_keys( array_slice($arr, 0, $limit) );
}

// usage:
// $meta_keywords = implode(", ", top_words( strip_tags( $html_content ) ) );
?>

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Use <pre> all your html php come here </pre> for your code
  • Allowed HTML tags: <a> <b> <pre> <h1> <h2> <h3> <h4> <h5> <h6> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <div> <style><img> <br> <blockquote>
  • Lines and paragraphs break automatically.
  • You may insert videos with [video:URL]

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

CafeWebmaster.com(CW) is a free online community for webdevelopers and beginners. Anybody can share their code, articles, tips, tutorials, code-examples or other webdesign related material on the site. Newbies can submit their questions and reply to existing questions. CW does not guarantee or warrant reliability of code, data and information published on the site. Use the site on your own risk. The site takes no responsibility of direct or indirect loss or any kind of harm to its users. The site also doesn't take responsibility of infected files or source code with any kind of infection or viruses, worms, spywares, malwares, trojan horses. CW reserves the right to edit, move, or delete any of content for any reason.