2019-09-07 11:58:52 +02:00
htmlquery
====
[![Build Status ](https://travis-ci.org/antchfx/htmlquery.svg?branch=master )](https://travis-ci.org/antchfx/htmlquery)
[![Coverage Status ](https://coveralls.io/repos/github/antchfx/htmlquery/badge.svg?branch=master )](https://coveralls.io/github/antchfx/htmlquery?branch=master)
[![GoDoc ](https://godoc.org/github.com/antchfx/htmlquery?status.svg )](https://godoc.org/github.com/antchfx/htmlquery)
[![Go Report Card ](https://goreportcard.com/badge/github.com/antchfx/htmlquery )](https://goreportcard.com/report/github.com/antchfx/htmlquery)
Overview
====
2020-12-05 17:36:50 +01:00
`htmlquery` is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.
2019-09-07 11:58:52 +02:00
2020-12-05 17:36:50 +01:00
`htmlquery` built-in the query object caching feature based on [LRU ](https://godoc.org/github.com/golang/groupcache/lru ), this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.
2019-09-07 11:58:52 +02:00
Installation
====
2020-12-05 17:36:50 +01:00
```
go get github.com/antchfx/htmlquery
```
2019-09-07 11:58:52 +02:00
Getting Started
====
2020-12-05 17:36:50 +01:00
#### Query, returns matched elements or error.
```go
nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
panic(`not a valid XPath expression.`)
}
```
2019-09-07 11:58:52 +02:00
#### Load HTML document from URL.
```go
doc, err := htmlquery.LoadURL("http://example.com/")
```
2020-12-05 17:36:50 +01:00
#### Load HTML from document.
```go
filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)
```
2019-09-07 11:58:52 +02:00
#### Load HTML document from string.
```go
s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))
```
#### Find all A elements.
```go
list := htmlquery.Find(doc, "//a")
```
#### Find all A elements that have `href` attribute.
```go
2022-03-26 12:13:52 +01:00
list := htmlquery.Find(doc, "//a[@href]")
2019-09-07 11:58:52 +02:00
```
2020-12-05 17:36:50 +01:00
#### Find all A elements with `href` attribute and only return `href` value.
2019-09-07 11:58:52 +02:00
```go
2022-03-26 12:13:52 +01:00
list := htmlquery.Find(doc, "//a/@href")
for _ , n := range list{
fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
2020-12-05 17:36:50 +01:00
}
2019-09-07 11:58:52 +02:00
```
### Find the third A element.
```go
a := htmlquery.FindOne(doc, "//a[3]")
```
2022-03-26 12:13:52 +01:00
### Find children element (img) under A `href` and print the source
```go
a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value
```
2019-09-07 11:58:52 +02:00
#### Evaluate the number of all IMG element.
```go
expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)
```
2020-12-05 17:36:50 +01:00
FAQ
====
#### `Find()` vs `QueryAll()`, which is better?
`Find` and `QueryAll` both do the same things, searches all of matched html nodes.
The `Find` will panics if you give an error XPath query, but `QueryAll` will return an error for you.
#### Can I save my query expression object for the next query?
Yes, you can. We offer the `QuerySelector` and `QuerySelectorAll` methods, It will accept your query expression object.
Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.
#### XPath query object cache performance
```
goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4 20000000 55.2 ns/op
BenchmarkDisableSelectorCache-4 500000 3162 ns/op
```
#### How to disable caching?
```
htmlquery.DisableSelectorCache = true
```
Changelogs
===
2019-11-19
- Add built-in query object cache feature, avoid re-compilation for the same query string. [#16 ](https://github.com/antchfx/htmlquery/issues/16 )
- Added LoadDoc [18 ](https://github.com/antchfx/htmlquery/pull/18 )
2019-10-05
- Add new methods that compatible with invalid XPath expression error: `QueryAll` and `Query` .
- Add `QuerySelector` and `QuerySelectorAll` methods, supported reused your query object.
2019-02-04
- [#7 ](https://github.com/antchfx/htmlquery/issues/7 ) Removed deprecated `FindEach()` and `FindEachWithBreak()` methods.
2018-12-28
- Avoid adding duplicate elements to list for `Find()` method. [#6 ](https://github.com/antchfx/htmlquery/issues/6 )
Tutorial
2019-09-07 11:58:52 +02:00
===
```go
func main() {
doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
if err != nil {
panic(err)
}
// Find all news item.
2020-12-05 17:36:50 +01:00
list, err := htmlquery.QueryAll(doc, "//ol/li")
if err != nil {
panic(err)
}
for i, n := range list {
2019-09-07 11:58:52 +02:00
a := htmlquery.FindOne(n, "//a")
fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
}
}
```
List of supported XPath query packages
===
2020-12-05 17:36:50 +01:00
| Name | Description |
| ------------------------------------------------- | ----------------------------------------- |
| [htmlquery ](https://github.com/antchfx/htmlquery ) | XPath query package for the HTML document |
| [xmlquery ](https://github.com/antchfx/xmlquery ) | XPath query package for the XML document |
| [jsonquery ](https://github.com/antchfx/jsonquery ) | XPath query package for the JSON document |
2019-09-07 11:58:52 +02:00
Questions
===
Please let me know if you have any questions.