Comment on essd-2021-284

General comment : This study is devoted to the description of a new dataset of microphysical cloud parameters from optically thin clouds, retrieved from infrared spectral radiances measured aboard the RV Polarstern in summer 2017 in the Arctic. Cloud optical depths, effective radii of hydrometeors (cloud droplets and ice crystals) as well as liquid and ice water paths are derived from a mobile Fourier-transform infrared spectrometer. The results are compared to those derived from a well-known synergy based on cloud radar, lidar and microwave radiometer measurements (Cloudnet). The study leans on an invaluable dataset built from observations sampled during one summer, in a region where such measurements are not so common. However, the manuscript often presents the results in a qualitative style without fully investigating the differences between the two datasets. The reader is left without a clear understanding of the significance of the differences in a statistical sense, and whenever it the case, without a clear explanation of why this new dataset would be more reliable. Major comments, described below must be taken into account before publication.

1/ The word « significant » is used a lot of times along the paper to say « large », forgetting the quantitative, scientific meaning of that word in a statistical sense. Which hypothesis is tested to confirm that this is really significant ? To which null hypothesis does the p-value refer to ? 2/ The authors never explain which variable has been calculated when they mention « significant correlations ». Does it refer to the Pearson correlation coefficient ? The coefficient of determination R² ? The Spearman's rank correlation coefficient ? In addition, providing « correlations », even though they are large, does not say anything about the discrepancies, but just mean than the parameters vary together. What are the biases and the root-mean-square errors ?
3/ There is a confusion about the term « standard deviation » that is used along the text (especially in Sect. 5.5) to express the RMSE.
The authors do not use standard quantitative scores widely used by the scientific community to evaluate the performance of an algorithm. What is called 'Mean' seems to be the 'Mean Bias'. This mean bias can be close to 0 due to compensation errors. The RMSE (root mean square error) usually gives a complementary information about the evaluation of performance. But what the authors use here, and that is called « STD (TC) », does not actually represent the full discrepancy between the retrieval and the true parameter as the RMSE would do. What has been calculated in the paper is the STD of the differences between the retrieval (r_i) and true parameter (t_i), which is : where x_i = r_i -t_i, and \bar{x} the average value of the x_i.
How much is the RMSE for each retrieved parameter ? 4/ Standard deviations are given with an accuracy of 2 significant digits after comma, for instance in the abstract. Is it really realistic ?
If I understand what has been calculated, the standard deviations are only dispersions. Did the authors also calculate the uncertainties on the retrieved parameters ? This is a crucial information for the reader interested in using this dataset. 5/ The methodology is justified in a weird way (e.g. L 41) : there are plenty of algorithms based on a similar approach that are freely available. Some of them are actually mentioned later in the paper (MIXCRA, CLARRA, XTRA). Can the authors explain exactly what is new in comparison to other published algorithms ?

Specific comments :
L 10-12 : it is not clear in the abstract what is the reference dataset and which one is evaluated in the paper. This sentence gives the impression that the authors aim to evaluate the data on opticall thin clouds measured bu microwave radiometers withing the Cloudnet framework (not from FTIR spectrometer).
L 13 : The syntax used here (« allows to perform […], which was the case[...] ») is misleading. The calculations of the cloud radiative effects are not performed in this study.
L 37 : « smaller uncertainty » : Based on the scientific litterature, how much is it ? L 56-60 : Only 4 lines do not justify a whole section. Sections 2 and 3 should be combined.
L 102 : « accuracy of ≥ ± 5 m ». This is confusing. Does it mean that the absolute error is larger than 5 m ? L 105 : Do the data from the Vaisala ceilometer and the Cloudnet profiles at least agree for the P106 period ? It is important to give the bias here as the ceilometer data are used during the entire cruise.
Sect. 5 is very long. It gives the impression that the paper focuses on the presentation of an algorithm rather than on the description and evaluation of the EM-FTIR measurements. Can the authors comment on the main objective of this paper ? L 116-118 : What are the main differences between the different algorithms ?
L 123 : Are aerosol optical properties included in the calculations, especially for dust particles in the infrared spectrum ?
L 138 : What about the size distribution of ice crystals ? Is it also prescribed ? Eq. 7 : What does \nu_n mean ? I had understood that \nu was the mean wave number in each intervall. Why should it be a function of n, defined as a iteration step in Eq. 3 ? Eq. 8 : Where do these values come from ? Have the authors perform a sensitivity study to evaluate the influence of S_a^{-1} on the final retrieved parameters ? L177-178 : It would better to use \sigma_{ice} everywhere, rather than ext(rice). The extinction coefficient of ice crystals should also depend on the temperature as the refractive indices do.
L 205 : As a consequence, the variance of rice is written by this convention \sigma_{rice}. To avoid confusion with the extinction coefficient of ice crystals, the authors may want to note this latter differently, for example \alpha_{ice}(r_{ice}). L 282-283 : Please comment those values. They seem extremely large to me. Does it suggest that the effective radii and liquid/ice water contents cannot be estimated by this approach ?
L 286 : What do the authors mean by the « standard deviations of r_{ice} » ? Is it a std on the parameter « r_{ice} » or the std on a difference as it is the case along the paper ? L 290-291 : This turns out to be only a partial conclusion. In the case of hollow columns for example, the retrieval is particularly bad in almost half of the cases, but it is not mentioned here.
L 294 : What are « differentials of IWP » ? Are they simply differences ?
Tables 5, 6, 7 : « Difference of r_{ice}/IWP/ \tau_{ice} ». What are the reference parameters ? L 301 : This is not the place for this. It is said later in a specific section.
L 350 : Do the authors conclude that the geometry of ice crystals was incorrect ? L 354 and following : This is a very strange way to write differences between two datasets. In the litterature, when we write « m ± s », it stands for a mean value m and a dispersion value, generally expressed by the standard deviation s. If we would rather to express a confidence intervall around m, it is usually written m ± s/ \sqrt{n}, where n is the number of values in the dataset. When comparing two datasets, it is common to use the mean bias (MB) and the RMSE, but they are never written as MB ± RMSE has the second one does not stand for a dispersion around the first one. Both are statistical variables expressing the discrepancies between a model distribution and a reference or observed value. In this section and the next ones, the way the values are given is very confusing. Fig. 10 : Values don't seem correlated and the r parameter is indeed very low. Are the data derived from TCWnet really reliable ? L 358 : « means and standard deviations for LWP and r_{liq} are shown ». In Table 9 caption, the text seems to indicate that the given values are means and standard deviations of differences. Which one is correct ? Sect. 6.1 : This small subsection is confusing and not very rigorous. Do the values given here significantly (e.g. in its statiscal sense, meaning using a statistical test) differ from the values obtained for the testcases ? L 318 : « there a less cases » : How many ? Which fraction does it represent ? Tables 8, 9 : Do «Mean » and « STD » stand for the mean and standard deviation of the parameter 'IWP', 'r_{ice}' or the standard deviation of the discrepancies between the variables retrieved from the TCWnet and Cloudnet ? In this latter case, it would be better to use the mean bias and the RMSE. Tables 8, 9 : What has been tested exactly by the p-value (never mentioned in the text) ? To which null hypothesis does the statistical test correspond ? What do the authors conclude with such values ? L 368 : « significant correlation » : the authors may rather want to say that the correlation coefficient is large enough. The statistical significance can then be discussed using the statistical test (and the associated p-value under a specified null hypothesis).
L 405-406 : The error is as large as the threshold on LWP. Can we say something about the agreement of the two datasets in this case ? L 407-410 : No statistical test has been performed nor discussed. It is therefore impossible to say anything about the significance.
L 409 : « too small », « overestimated » : this is very qualitative. By how much ? Are the differences larger than the uncertainties ?
L 414-417 : The paper underlines that the results on r_{ice}, r_{liq} and IWP are different from those derived by Cloudnet. Is it worth publishing such results if the values significantly differ ? Which dataset is reliable ?

Technical comments :
The syntax is often incorrect and there are a lot a typos in the current version. The text needs to be checked very carefully, and ideally be corrected by a native speaker.
L 51 : A closing parenthesis is missing here.     : Replace « divided by the chosen ice particle shape » by « for each ice particle shape ».
L 327 : Replace « are the spectral windows » by « is the spectral window ». L 403 : Add a « that » at the end of the sentence.