.. The turbomoleio package, a python interface to Turbomole for preparing inputs, parsing outputs and other related tools. Copyright (C) 2018-2022 BASF SE, Matgenix SRL. This file is part of turbomoleio. Turbomoleio is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Turbomoleio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with turbomoleio (see ~turbomoleio/COPYING). If not, see . .. _developer_parse_logs: =================== Output logs parsing =================== Most of the details about how the parsing of log output files is explained in the :ref:`parsing_outputs` section of the user guide. Here we will provide more information about the different objects involved and what are the constraints that should be followed to add the parsing of new quantities or of new types of outputs to the current implementation. Let us first have a more in-depth inspection of the :class:`turbomoleio.output.parser.Parser` object. The ``Parser`` takes the string of a TURBOMOLE log file as an input. It is then used to extract the relevant data information required for each Data or File object. The Parser is made of several parsing methods, each of which is in charge of parsing a specific portion of the log file. Some of the parsing methods are very specific to one type of calculation (e.g. only relevant for an escf output, or for a statpt output), while others are common to several of them (e.g. :meth:`turbomoleio.output.parser.Parser.basis`, :meth:`turbomoleio.output.parser.Parser.header`). Each parsing method returns a dictionary with the data parsed or ``None`` if the section to be parsed could not be found in the string. This dictionary is meant to be used by the Data and File objects during their instantiation. The parsing methods are implemented as *lazy* properties, meaning that they will be generated only once, using the ``lazy_property`` context manager available in the `monty package `_. This is advantageous since some sections contain mixed information of the same property and it may be called more than once while building a File object. The data is thus stored temporarily in the ``Parser`` instance, but this should not be a problem, because the information extracted is relatively small and the parser instance is meant to be disposed at the end of the generation of the object. In general, when possible, the parsing methods first narrow down the section of the text that contains the information that should be extracted and then work on this to extract the exact data needed. This is usually because the outputs of TURBOMOLE do not have an organized structure and this allows to have a target specific lines of the output more easily. The information parsed with the ``Parser`` are meant to be used by the Data objects to create an instance. In particular all the Data and File objects are subclasses of the :class:`turbomoleio.output.data.BaseData` abstract class. This has a single abstract method, ``from_parser``, that should implemented by subclasses and should be able to create a new instance of the Data or File object from a ``Parser`` instance. The ``BaseData`` class provides a ``from_string`` and a ``from_file`` method that rely on ``from_parser`` to create an instance of the subclass from the string of the output or from the file. The Data objects are meant to contain data that can be grouped together by meaning or by type. In the ``from_parser`` method it can use the information coming from a single property of the ``Parser`` or from more than one. On the other hand the File objects in the ``from_parser`` preferably do not access directly the properties in the ``Parser`` but instead create instances of Data object and store mainly Data instances as attributes. From a design point of view, this whole approach has been chosen since it allows to: * share parsing functions for different types of calculations, * quickly instantiate Data objects based on a single portion of the output, e.g. if the user only wants information about the energy, he can obtain it running ``ScfEnergiesData.from_file("ridft.log")`` without the waste of parsing all the rest of the file. * update the parsing of a specific section for new versions of TURBOMOLE without impacting the other parsing methods. This is possible since the parsing methods for the different sections are isolated. .. _developer_parse_new_quantity: Parsing a new quantity ====================== If you want to start parsing a new quantity for one of the types of log files that are already handled you should proceed from the bottom up. First check if the quantity that you want to parse is part of a section that is already parsed by some method of the ``Parser``. If that is the case you can probably modify that method, otherwise you should create a new property, mark it with the ``lazy_property`` context manager and implement the parsing inside it. A good approach is to use regular expressions both to narrow down the section that you want to focus on and to extract the exact values that you want, but for small sections analyzing the lines one by one is also acceptable. .. note:: When writing the parser always consider that the output of TURBOMOLE can change considerably depending on the different options provided (e.g. an entire section or single values might appear/disappear). Always check that your parser is working under different conditions. Once the ``Parser`` has been modified you should either update one (or more) of the Data objects or write a new one, depending on the type of information that you extracted. In the latter case subclass the ``BaseData`` class and get your newly parsed information in the ``from_parser`` method. Finally, if a new Data object has been created, modify the appropriate File objects that should contain it. .. warning:: When adding new attributes to Data and File objects always set a ``None`` as a default in the ``__init__``. In this way data that have been json serialized with previous versions of the code will still be deserialized correctly (only your newly added property will be missing). If instead you change or remove one of the existing attributes the old data will not be deserialized anymore. Backward compatibility changes as these should be though carefully and agreed upon by the community of users. Parsing a new type of log ========================= When parsing the output provided by an executable that is not supported by the current version of turbomoleio you should proceed in a similar manner as when :ref:`developer_parse_new_quantity`. In this case you will probably need to add several new methods to the ``Parser``, addressing the different quantities and information that you want to extract. Also check which ones of the existing properties are working for your kind of output, since some of the are likely to be compatible (e.g. the ``header`` attribute will very likely be equivalent in your case as well). After creating the parser methods, your should encapsulate that information inside Data objects and create a new File object where you will store all the extracted data. The same recommendations given in the previous sections hold here as well. Lastly, if this is suitable, add your object to the :data:`turbomoleio.outputs.files.exec_to_out_obj` dictionary. This will be used as a reference to decide which File object to use when running a specific executable. In particular it will be used in the unit tests, as explained below. Tests ===== For the main discussion concerning the testing in turbomoleio you should refer to the :ref:`developer_testing` section of this developer guide. However, given the particular nature of the unit tests implemented for the log parsing we will provide some more explanations here. The tests for the ``Parser`` object are performed running all the methods implemented on a series a TURBOMOLE output files. The generated dictionaries are then compared with references stored in the ``turbomoleio/testfiles/outputs`` folder as JSON files. A tolerance is allowed, given potentially small differences that can happen while converting strings to floats, but in general the numerical value should be exactly equivalent to those parsed. Note that for some combinations of files and methods the output will simply be ``None``. In case you want to add a new output file to be tested you should add it in the ``testfiles`` folder and also add it to the list in :data:`turbomoleio.output.tests.test_parser.files_list`. If instead you are adding one or more new methods to the ``Parser``, remember to add their name to the list in :data:`turbomoleio.output.tests.test_parser.parser_methods`. In case you need to generate the reference JSON file again, maybe because you have modified one of the existing ``Parser``'s method or because you added a new one, you can use the :func:`turbomoleio.output.tests.test_parser.generate_files` helper function. This will generate a new JSON and **overwrite the old one** for all the files and methods that have been given in input. So you should be extremely careful when running it, since the generated files will become the new reference. If a bug is introduced in the ``Parser``, the reference files will be generated with a bugged version and the tests will partially loose their use. A similar approach has been chosen for the testing of the Data and Files objects. Since all the Data objects are contained in at least one File object, the tests concerning the parsing will be performed only at the level of the File objects, since repeating them for the Data as well would just be redundant. The structure of the tests is similar to those for the ``Parser``. The test output files are parsed using the ``from_file`` method of the corresponding File object. The object is converted to a dictionary and compared with the reference stored in the corresponding JSON file. As before, if you want to add a new output file to be parsed, you should add it in the ``testfiles`` folder and in the list in :data:`turbomoleio.output.tests.test_files.files_list`. In addition if you want to add a new type of File object you should either add it to the :data:`turbomoleio.outputs.files.exec_to_out_obj` or make it available in the :func:`turbomoleio.output.tests.test_files.cls_dict_path` fixture (follow the example in the case of ``EscfOnlyOutput`` there). A function to generate the reference JSON files similar to the one described above is available: :func:`turbomoleio.output.tests.test_files.generate_files`. The same warnings of dealing with it carefully should be kept in mind here as well.