thinkphp自动采集怎么实现-小浪学习网

thinkphp实现自动采集功能的三种方法：

方法一：QueryList

个人感觉比较好用，采集详情比较不错的选择，但是采集复杂一点的列表，不好用。具体使用：

thinkphp自动采集怎么实现

控制器示例：

public function index(){     // 使用采集类     // 使用手册 ：http://www.php.cn/php/php-QueryList3-thinkphp.html     import('Org.QL.QueryList');     $url = "http://www.zyctd.com/gqqg/";     $reg = array();     $reg['title'] = array('.sulist_title','text');     $reg['shuliang'] = array('.su_li1','html');     $obj = new QueryList($url,$reg);     $data = $obj-&gt;jsonArr;     // foreach($data as $v){     //     echo "<br>".$v['title'].'___'.$v['shuliang']."<br>";     // }     p($data); }

相关推荐：《ThinkPHP教程》

方法二：simple_html_dom

这个方法比较适合采集一点结构简单的页面，HTML标签的类名比较明确的页面，还不错。具体使用：

thinkphp自动采集怎么实现

控制器示例：

public function index(){     // 参考文档：http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart     // 下载地址：https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php     // 使用方法：http://www.thinkphp.cn/topic/21635.html     import("Org.Util.simple_html_dom", '', '.php');     $html = file_get_html('http://www.zyctd.com/gqqg/');     $ret = $html-&gt;find('.supply_list_box ul',0)-&gt;first_child();     foreach($ret as $v){         echo $v;     }; }

方法三：获取页面HTMl，进行正则匹配采集

举例一个Demo：

采集一个页面：

http://www.zyctd.com/gqqg/

我要获取上面的四个信息：标题，数量，时间，跳转链接。

thinkphp自动采集怎么实现

获取这些信息，通过上面两种方法都采集不到，最后才选用的正则来采集。具体方法：

public function index(){     $url = "http://www.zyctd.com/gqqg/";     // http://www.zyctd.com/gqqg-p1.html     $supplyDB = M('supply');         $urlList = array();     $array = array();     for($x=1; $xgetInfo($v);         array_push($array,$curPageList);     };     foreach($array as $v){         foreach($v as $vv){             //echo $vv['title']."__".$vv['weight']."__".$vv['time']."<br>";             $data = array();             $data['title'] = $vv['title'];             $data['weight'] = $vv['weight'];             $data['add_time'] = $vv['add_time'];             $data['url'] = $vv['url'];             //$res = $supplyDB-&gt;add($data);             //echo $res;             echo "<p><span>".$vv['title']."</span>             <span>".$vv['weight']."</span>             <span>".$vv['add_time']."</span>             <span>".$vv['url']."</span></p>";         }     }         // 获取信息         //$curPageList = $this-&gt;getInfo($html);         //p($curPageList); } private function getInfo($url){     $html = $this-&gt;getHtml($url);     $array = array();     // 匹配所有的标题     preg_match_all("#<divclass><i></i><span>(.*?)</span>#",$html,$matches);     $all_title = $matches[1];     preg_match_all("#<i>发布时间：</i><span>(.*?)</span>#",$html,$matches);     // 匹配所有的发布时间     $all_time = $matches[1];     // 匹配所有的求购数量     preg_match_all("#<i>求购数量：</i><span>(.*?)</span>#",$html,$matches);     $all_weight = $matches[1];     // 匹配跳转链接     preg_match_all("#<atarget>#",$html,$matches);     $all_url = $matches[1];     // 组合     foreach($all_title as $k =&gt; $v){         $arr = array();         $arr['title'] = $v;         $arr['weight'] = $all_weight[$k];         $arr['add_time'] = $all_time[$k];         $arr['url'] = $all_url[$k];         array_push($array,$arr);     }     return $array; } private function getHtml($url){     $html = file_get_contents($url);     $html = preg_replace("# #","",$html);     $html = preg_replace("# #","",$html);     $html = preg_replace("#s#","",$html);     return $html; }</atarget></divclass>

以上就是

文章版权归作者所有，未经允许请勿转载。

THE END

喜欢就支持一下吧