Output logs parsing
Most of the details about how the parsing of log output files is explained in the Outputs parsing section of the user guide. Here we will provide more information about the different objects involved and what are the constraints that should be followed to add the parsing of new quantities or of new types of outputs to the current implementation.
Let us first have a more in-depth inspection of the turbomoleio.output.parser.Parser
object. The Parser
takes the string of a TURBOMOLE log file as an input. It is then used to extract
the relevant data information required for each Data or File object. The Parser is made of several
parsing methods, each of which is in charge of parsing a specific portion of the log file.
Some of the parsing methods are very specific to one type of calculation (e.g. only relevant for an escf
output, or for a statpt output), while others are common to several of them (e.g.
turbomoleio.output.parser.Parser.basis()
, turbomoleio.output.parser.Parser.header()
).
Each parsing method returns a dictionary with the data parsed or None
if the section to be parsed
could not be found in the string. This dictionary is meant to be used by the Data and File objects
during their instantiation.
The parsing methods are implemented as lazy properties, meaning that they will be generated only
once, using the lazy_property
context manager available in the
monty package.
This is advantageous since some sections contain mixed information of the same property and it
may be called more than once while building a File object. The data is thus stored temporarily
in the Parser
instance, but this should not be a problem, because the information
extracted is relatively small and the parser instance is meant to be disposed at the end of the
generation of the object.
In general, when possible, the parsing methods first narrow down the section of the text that contains the information that should be extracted and then work on this to extract the exact data needed. This is usually because the outputs of TURBOMOLE do not have an organized structure and this allows to have a target specific lines of the output more easily.
The information parsed with the Parser
are meant to be used by the Data objects to create
an instance. In particular all the Data and File objects are subclasses of the
turbomoleio.output.data.BaseData
abstract class. This has a single abstract method,
from_parser
, that should implemented by subclasses and should be able to create a new instance
of the Data or File object from a Parser
instance. The BaseData
class provides a
from_string
and a from_file
method that rely on from_parser
to create an instance of
the subclass from the string of the output or from the file.
The Data objects are meant to contain data that can be grouped together by meaning or by type.
In the from_parser
method it can use the information coming from a single property of the
Parser
or from more than one. On the other hand the File objects in the from_parser
preferably do not access directly the properties in the Parser
but instead create instances
of Data object and store mainly Data instances as attributes.
From a design point of view, this whole approach has been chosen since it allows to:
share parsing functions for different types of calculations,
quickly instantiate Data objects based on a single portion of the output, e.g. if the user only wants information about the energy, he can obtain it running
ScfEnergiesData.from_file("ridft.log")
without the waste of parsing all the rest of the file.update the parsing of a specific section for new versions of TURBOMOLE without impacting the other parsing methods. This is possible since the parsing methods for the different sections are isolated.
Parsing a new quantity
If you want to start parsing a new quantity for one of the types of log files that are already
handled you should proceed from the bottom up. First check if the quantity that you want to parse
is part of a section that is already parsed by some method of the Parser
. If that is the case
you can probably modify that method, otherwise you should create a new property, mark it with
the lazy_property
context manager and implement the parsing inside it. A good approach is
to use regular expressions both to narrow down the section that you want to focus on and to
extract the exact values that you want, but for small sections analyzing the lines one by one
is also acceptable.
Note
When writing the parser always consider that the output of TURBOMOLE can change considerably depending on the different options provided (e.g. an entire section or single values might appear/disappear). Always check that your parser is working under different conditions.
Once the Parser
has been modified you should either update one (or more) of the Data objects
or write a new one, depending on the type of information that you extracted. In the latter case
subclass the BaseData
class and get your newly parsed information in the from_parser
method.
Finally, if a new Data object has been created, modify the appropriate File objects that should contain it.
Warning
When adding new attributes to Data and File objects always set a None
as a default
in the __init__
. In this way data that have been json serialized with previous versions
of the code will still be deserialized correctly (only your newly added property will
be missing). If instead you change or remove one of the existing attributes the old
data will not be deserialized anymore. Backward compatibility changes as these should
be though carefully and agreed upon by the community of users.
Parsing a new type of log
When parsing the output provided by an executable that is not supported by the current
version of turbomoleio you should proceed in a similar manner as when
Parsing a new quantity. In this case you will probably need to add several
new methods to the Parser
, addressing the different quantities and information that
you want to extract. Also check which ones of the existing properties are working for
your kind of output, since some of the are likely to be compatible (e.g. the header
attribute will very likely be equivalent in your case as well).
After creating the parser methods, your should encapsulate that information inside Data objects and create a new File object where you will store all the extracted data. The same recommendations given in the previous sections hold here as well.
Lastly, if this is suitable, add your object to the turbomoleio.outputs.files.exec_to_out_obj
dictionary. This will be used as a reference to decide which File object to use when
running a specific executable. In particular it will be used in the unit tests, as
explained below.
Tests
For the main discussion concerning the testing in turbomoleio you should refer to the Testing section of this developer guide. However, given the particular nature of the unit tests implemented for the log parsing we will provide some more explanations here.
The tests for the Parser
object are performed running all the methods implemented
on a series a TURBOMOLE output files. The generated dictionaries are then compared
with references stored in the turbomoleio/testfiles/outputs
folder as JSON files.
A tolerance is allowed, given potentially small differences that can happen while converting
strings to floats, but in general the numerical value should be exactly equivalent to
those parsed. Note that for some combinations of files and methods the output will simply
be None
.
In case you want to add a new output file to be tested you should add it in the testfiles
folder and also add it to the list in turbomoleio.output.tests.test_parser.files_list
.
If instead you are adding one or more new methods to the Parser
, remember to add their name
to the list in turbomoleio.output.tests.test_parser.parser_methods
.
In case you need to generate the reference JSON file again, maybe because you have modified
one of the existing Parser
’s method or because you added a new one, you can use the
turbomoleio.output.tests.test_parser.generate_files()
helper function. This will
generate a new JSON and overwrite the old one for all the files and methods that have
been given in input. So you should be extremely careful when running it, since the generated
files will become the new reference. If a bug is introduced in the Parser
, the reference
files will be generated with a bugged version and the tests will partially loose their use.
A similar approach has been chosen for the testing of the Data and Files objects. Since all the Data objects are contained in at least one File object, the tests concerning the parsing will be performed only at the level of the File objects, since repeating them for the Data as well would just be redundant.
The structure of the tests is similar to those for the Parser
. The test output files are
parsed using the from_file
method of the corresponding File object. The object is converted
to a dictionary and compared with the reference stored in the corresponding JSON file.
As before, if you want to add a new output file to be parsed, you should add it in the
testfiles
folder and in the list in turbomoleio.output.tests.test_files.files_list
.
In addition if you want to add a new type of File object you should either add it to the
turbomoleio.outputs.files.exec_to_out_obj
or make it available in the
turbomoleio.output.tests.test_files.cls_dict_path()
fixture (follow the example
in the case of EscfOnlyOutput
there).
A function to generate the reference JSON files similar to the one described above is available:
turbomoleio.output.tests.test_files.generate_files()
. The same warnings of dealing
with it carefully should be kept in mind here as well.