Replies: 1 comment 1 reply
-
|
@moorereason @bep would love your sanity check on this, especially want to make sure that it works for Hugo. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This document outlines the design and behavior changes go-toml v2's unmarshaler, as implemented in the v2-wip branch.
Behavior changes
Errors contain human readable context
When a parsing or unmarshaling error occurs,
Unmarshalreturns atoml.DecodeError, which exposes two methods:Error()-- to satisfy theerrorinterface. It returns the error message itself. For example: "invalid date-time timezone".String()-- to satisfy thefmt.Stringerinterface. It returns a contextualized, human readable version of the error. For example:Automatic field name guessing
When a unmarshaling to a struct, if a key in the TOML document does not exactly match the name of a struct field or any of the
toml-tagged field, v1 tries multiple variations of the key (code).This adds complexity in the Unmarshal code, resulting in potentially more allocations. The same result can effectively be achieved with setting the appropriate
tomltags to the fields that don't exactly match the expected key in the document.Stdlib's encoding/json similarily does not try to guess field names. It checks for either an exact match or a case-insensitive match. It provides the
jsontag to pick a different field that wouldn't follow this rule. This unmarshaler followsencoding/json's method.Ignore pre-existing value in interface
When decoding into a non-nil
interface{}, go-toml v1 uses the type of the element in the interface to decode the object. For example:In this case, field
Ais of typeinterface{}, containing ainnerstruct. go-toml v1 sees that type and uses it when decoding the object.Other libraries such as
encoding/jsonand BurntSushi's toml have a different behavior. When decoding an object into aninterface{}, they instead decide to disregard whatever value theinterface{}may contain and replace it with amap[string]interface{}. With the same datastructure as above, here is what the result looks like:I believe there are two motivations for that behavior. First, when the value is in an
interface{}it is no addressable, so it needs to be rebuilt anyway. Second, it is to avoid the output types to change depending on the input (in this example, the program would have to deal withAcontaining either aninnerstruct or amap[string]interface{}depending on whether in put has anilelement in the interface or not).go-toml v2's unmarshaler changes the behavior to match
encoding/json's.Example code: https://play.golang.org/p/VxkwavVlqo1
Values out of array bounds ignored
When decoding into an array, go-toml v1 returns an error when the number of elements contained in the doc is superior to the capacity of the array. For example:
In the same situation,
encoding/jsonignores the last value:go-toml v2 follows
encoding/json's behavior, on the grounds that providing sticking to the standard library's bevhior reduces surprises when using go-toml. But I can easily be convinced this is a mistake.Support for
toml.Unmarshalerhas been droppedThis method does not seem to be wildly used. This sourcegraph query shows that the main public user of this is influxdata. Apparently this method is being used to parse custom units without having to put them in strings. For example:
duration = 1sis accepted. A similar effect can be achieved using theencoding.TextUnmarshalerand keeping the units in strings (note: their code actually accepts both the in-string unquoted versions). Of course not perfect, but we don't have more info on usage.The flexible structure of TOML makes this method difficult to define: what bytes should be passed to the
UnmarshalTOMLmethod? This question needs to be answered because the canonical signature of this method does not allow to indicate how many bytes were parsed. Departing from this signature would go significantly against the goal of staying close to the behavior ofencoding/json.To avoid unnecessary complexity this feature has been left out. If there is an ask to bring this back we can reconsider this decision.
Implementation notes
The new parser does one pass over the input bytes to construct an AST. It does not however attempt any parsing or validation of values, which are left as byte slices pointing into the input data. As a result, the unmarshaler chooses to decode each value based on what structure it is building (and entirely skips whole branches of the AST when they do not exist in the target structure).
This two-passes design is reasonably performant, and allows to reuse the parsing structure in both the Unmarshaler and the Document interface. Initially I attempted a one-pass version, but it required to create set up and synchronize two stacks (the parser's and the unmarshaler's), which made the code difficult to follow and debug, especially when trying to skip over unused parts of the document.
As an optimization, the parser only parses one top level expression at a time. Coupled with the fact that the AST is stored in a single array, it allows to reuse storage from one expression to the next, maintaining allocations to a minimum.
When decoding into struct, reflection is used once per type to figure out fields and their acceptable keys. That information is cached in a global concurrent-safe map, which minimizes the overhead to reflecting on structs and should allow the cache to converge to all the types used by the program.
Verification of key uniqueness and types is implemented as a separate pass on the AST. This method decouples the TOML-specific semantic check from building the target structure. It can then be reused efficiently by the Document builder.
Beta Was this translation helpful? Give feedback.
All reactions