| 程序包 | 说明 |
|---|---|
| us.codecraft.webmagic |
Main class "Spider" and models.
|
| us.codecraft.webmagic.downloader |
Downloader is the part that downloads web pages and store in Page object.
|
| us.codecraft.webmagic.scheduler |
Scheduler is the part of url management.
|
| us.codecraft.webmagic.scheduler.component |
Component of scheduler.
|
| us.codecraft.webmagic.utils |
Static utils of webmagic.
|
| 限定符和类型 | 字段和说明 |
|---|---|
protected List<Request> |
Spider.startRequests |
| 限定符和类型 | 方法和说明 |
|---|---|
Request |
ResultItems.getRequest() |
Request |
Page.getRequest()
get request of current page
|
Request |
Request.putExtra(String key,
Object value) |
Request |
Request.setPriority(long priority)
Set the priority of request for sorting.
|
| 限定符和类型 | 方法和说明 |
|---|---|
List<Request> |
Site.getStartRequests() |
List<Request> |
Page.getTargetRequests() |
| 限定符和类型 | 方法和说明 |
|---|---|
Spider |
Spider.addRequest(Request... requests)
Add urls with information to crawl.
|
Site |
Site.addStartRequest(Request startRequest)
已过时。
|
void |
Page.addTargetRequest(Request request)
add requests to fetch
|
void |
SpiderListener.onError(Request request) |
protected void |
Spider.onError(Request request) |
void |
SpiderListener.onSuccess(Request request) |
protected void |
Spider.onSuccess(Request request) |
protected void |
Spider.processRequest(Request request) |
ResultItems |
ResultItems.setRequest(Request request) |
void |
Page.setRequest(Request request) |
| 限定符和类型 | 方法和说明 |
|---|---|
Spider |
Spider.startRequest(List<Request> startRequests)
Set startUrls of Spider.
|
| 限定符和类型 | 方法和说明 |
|---|---|
protected Page |
AbstractDownloader.addToCycleRetry(Request request,
Site site) |
Page |
HttpClientDownloader.download(Request request,
Task task) |
Page |
Downloader.download(Request request,
Task task)
Downloads web pages and store in Page object.
|
protected org.apache.http.client.methods.HttpUriRequest |
HttpClientDownloader.getHttpUriRequest(Request request,
Site site,
Map<String,String> headers) |
protected Page |
HttpClientDownloader.handleResponse(Request request,
String charset,
org.apache.http.HttpResponse httpResponse,
Task task) |
protected void |
AbstractDownloader.onError(Request request) |
protected void |
AbstractDownloader.onSuccess(Request request) |
protected org.apache.http.client.methods.RequestBuilder |
HttpClientDownloader.selectRequestMethod(Request request) |
| 限定符和类型 | 方法和说明 |
|---|---|
Request |
Scheduler.poll(Task task)
get an url to crawl
|
Request |
QueueScheduler.poll(Task task) |
Request |
PriorityScheduler.poll(Task task) |
| 限定符和类型 | 方法和说明 |
|---|---|
void |
Scheduler.push(Request request,
Task task)
add a url to fetch
|
void |
DuplicateRemovedScheduler.push(Request request,
Task task) |
void |
QueueScheduler.pushWhenNoDuplicate(Request request,
Task task) |
void |
PriorityScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
DuplicateRemovedScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected boolean |
DuplicateRemovedScheduler.shouldReserved(Request request) |
| 限定符和类型 | 方法和说明 |
|---|---|
protected String |
HashSetDuplicateRemover.getUrl(Request request) |
protected String |
BloomFilterDuplicateRemover.getUrl(Request request) |
boolean |
HashSetDuplicateRemover.isDuplicate(Request request,
Task task) |
boolean |
DuplicateRemover.isDuplicate(Request request,
Task task)
Check whether the request is duplicate.
|
boolean |
BloomFilterDuplicateRemover.isDuplicate(Request request,
Task task) |
| 限定符和类型 | 方法和说明 |
|---|---|
static List<Request> |
UrlUtils.convertToRequests(Collection<String> urls) |
| 限定符和类型 | 方法和说明 |
|---|---|
static List<String> |
UrlUtils.convertToUrls(Collection<Request> requests) |
Copyright © 2016. All rights reserved.