11 KiB
godirwalk
godirwalk
is a library for traversing a directory tree on a file
system.
In short, why do I use this library?
- It's faster than
filepath.Walk
. - It's more correct on Windows than
filepath.Walk
. - It's more easy to use than
filepath.Walk
. - It's more flexible than
filepath.Walk
.
Usage Example
Additional examples are provided in the examples/
subdirectory.
This library will normalize the provided top level directory name
based on the os-specific path separator by calling filepath.Clean
on
its first argument. However it always provides the pathname created by
using the correct os-specific path separator when invoking the
provided callback function.
dirname := "some/directory/root"
err := godirwalk.Walk(dirname, &godirwalk.Options{
Callback: func(osPathname string, de *godirwalk.Dirent) error {
fmt.Printf("%s %s\n", de.ModeType(), osPathname)
return nil
},
Unsorted: true, // (optional) set true for faster yet non-deterministic enumeration (see godoc)
})
This library not only provides functions for traversing a file system
directory tree, but also for obtaining a list of immediate descendants
of a particular directory, typically much more quickly than using
os.ReadDir
or os.ReadDirnames
.
Description
Here's why I use godirwalk
in preference to filepath.Walk
,
os.ReadDir
, and os.ReadDirnames
.
It's faster than filepath.Walk
When compared against filepath.Walk
in benchmarks, it has been
observed to run between five and ten times the speed on darwin, at
speeds comparable to the that of the unix find
utility; about twice
the speed on linux; and about four times the speed on Windows.
How does it obtain this performance boost? It does less work to give
you nearly the same output. This library calls the same syscall
functions to do the work, but it makes fewer calls, does not throw
away information that it might need, and creates less memory churn
along the way by reusing the same scratch buffer for reading from a
directory rather than reallocating a new buffer every time it reads
file system entry data from the operating system.
While traversing a file system directory tree, filepath.Walk
obtains
the list of immediate descendants of a directory, and throws away the
file system node type information provided by the operating system
that comes with the node's name. Then, immediately prior to invoking
the callback function, filepath.Walk
invokes os.Stat
for each
node, and passes the returned os.FileInfo
information to the
callback.
While the os.FileInfo
information provided by os.Stat
is extremely
helpful--and even includes the os.FileMode
data--providing it
requires an additional system call for each node.
Because most callbacks only care about what the node type is, this
library does not throw the type information away, but rather provides
that information to the callback function in the form of a
os.FileMode
value. Note that the provided os.FileMode
value that
this library provides only has the node type information, and does not
have the permission bits, sticky bits, or other information from the
file's mode. If the callback does care about a particular node's
entire os.FileInfo
data structure, the callback can easiy invoke
os.Stat
when needed, and only when needed.
Benchmarks
macOS
$ go test -bench=. -benchmem
goos: darwin
goarch: amd64
pkg: github.com/karrick/godirwalk
BenchmarkReadDirnamesStandardLibrary-12 50000 26250 ns/op 10360 B/op 16 allocs/op
BenchmarkReadDirnamesThisLibrary-12 50000 24372 ns/op 5064 B/op 20 allocs/op
BenchmarkFilepathWalk-12 1 1099524875 ns/op 228415912 B/op 416952 allocs/op
BenchmarkGodirwalk-12 2 526754589 ns/op 103110464 B/op 451442 allocs/op
BenchmarkGodirwalkUnsorted-12 3 509219296 ns/op 100751400 B/op 378800 allocs/op
BenchmarkFlameGraphFilepathWalk-12 1 7478618820 ns/op 2284138176 B/op 4169453 allocs/op
BenchmarkFlameGraphGodirwalk-12 1 4977264058 ns/op 1031105328 B/op 4514423 allocs/op
PASS
ok github.com/karrick/godirwalk 21.219s
Linux
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/karrick/godirwalk
BenchmarkReadDirnamesStandardLibrary-12 100000 15458 ns/op 10360 B/op 16 allocs/op
BenchmarkReadDirnamesThisLibrary-12 100000 14646 ns/op 5064 B/op 20 allocs/op
BenchmarkFilepathWalk-12 2 631034745 ns/op 228210216 B/op 416939 allocs/op
BenchmarkGodirwalk-12 3 358714883 ns/op 102988664 B/op 451437 allocs/op
BenchmarkGodirwalkUnsorted-12 3 355363915 ns/op 100629234 B/op 378796 allocs/op
BenchmarkFlameGraphFilepathWalk-12 1 6086913991 ns/op 2282104720 B/op 4169417 allocs/op
BenchmarkFlameGraphGodirwalk-12 1 3456398824 ns/op 1029886400 B/op 4514373 allocs/op
PASS
ok github.com/karrick/godirwalk 19.179s
It's more correct on Windows than filepath.Walk
I did not previously care about this either, but humor me. We all love how we can write once and run everywhere. It is essential for the language's adoption, growth, and success, that the software we create can run unmodified on all architectures and operating systems supported by Go.
When the traversed file system has a logical loop caused by symbolic
links to directories, on unix filepath.Walk
ignores symbolic links
and traverses the entire directory tree without error. On Windows
however, filepath.Walk
will continue following directory symbolic
links, even though it is not supposed to, eventually causing
filepath.Walk
to terminate early and return an error when the
pathname gets too long from concatenating endless loops of symbolic
links onto the pathname. This error comes from Windows, passes through
filepath.Walk
, and to the upstream client running filepath.Walk
.
The takeaway is that behavior is different based on which platform
filepath.Walk
is running. While this is clearly not intentional,
until it is fixed in the standard library, it presents a compatibility
problem.
This library correctly identifies symbolic links that point to
directories and will only follow them when FollowSymbolicLinks
is
set to true. Behavior on Windows and other operating systems is
identical.
It's more easy to use than filepath.Walk
Since this library does not invoke os.Stat
on every file system node
it encounters, there is no possible error event for the callback
function to filter on. The third argument in the filepath.WalkFunc
function signature to pass the error from os.Stat
to the callback
function is no longer necessary, and thus eliminated from signature of
the callback function from this library.
Also, filepath.Walk
invokes the callback function with a solidus
delimited pathname regardless of the os-specific path separator. This
library invokes the callback function with the os-specific pathname
separator, obviating a call to filepath.Clean
in the callback
function for each node prior to actually using the provided pathname.
In other words, even on Windows, filepath.Walk
will invoke the
callback with some/path/to/foo.txt
, requiring well written clients
to perform pathname normalization for every file prior to working with
the specified file. In truth, many clients developed on unix and not
tested on Windows neglect this subtlety, and will result in software
bugs when running on Windows. This library would invoke the callback
function with some\path\to\foo.txt
for the same file when running on
Windows, eliminating the need to normalize the pathname by the client,
and lessen the likelyhood that a client will work on unix but not on
Windows.
It's more flexible than filepath.Walk
Configurable Handling of Symbolic Links
The default behavior of this library is to ignore symbolic links to
directories when walking a directory tree, just like filepath.Walk
does. However, it does invoke the callback function with each node it
finds, including symbolic links. If a particular use case exists to
follow symbolic links when traversing a directory tree, this library
can be invoked in manner to do so, by setting the
FollowSymbolicLinks
parameter to true.
Configurable Sorting of Directory Children
The default behavior of this library is to always sort the immediate
descendants of a directory prior to visiting each node, just like
filepath.Walk
does. This is usually the desired behavior. However,
this does come at slight performance and memory penalties required to
sort the names when a directory node has many entries. Additionally if
caller specifies Unsorted
enumeration, reading directories is lazily
performed as the caller consumes entries. If a particular use case
exists that does not require sorting the directory's immediate
descendants prior to visiting its nodes, this library will skip the
sorting step when the Unsorted
parameter is set to true.
Here's an interesting read of the potential hazzards of traversing a
file system hierarchy in a non-deterministic order. If you know the
problem you are solving is not affected by the order files are
visited, then I encourage you to use Unsorted
. Otherwise skip
setting this option.
Researchers find bug in Python script may have affected hundreds of studies
Configurable Post Children Callback
This library provides upstream code with the ability to specify a
callback to be invoked for each directory after its children are
processed. This has been used to recursively delete empty directories
after traversing the file system in a more efficient manner. See the
examples/clean-empties
directory for an example of this usage.
Configurable Error Callback
This library provides upstream code with the ability to specify a
callback to be invoked for errors that the operating system returns,
allowing the upstream code to determine the next course of action to
take, whether to halt walking the hierarchy, as it would do were no
error callback provided, or skip the node that caused the error. See
the examples/walk-fast
directory for an example of this usage.