How to import ETER data into Stata

The STATA programming language is very useful for working with the ETER database. The scope of this section is therefore to help our users to import in the most effective way ETER data in STATA. After a description of the process, we summarize the corresponding syntax at the end of this post.

How to download the ETER file

As a prerequisite, other than STATA installed on your computer, we suggest downloading the data from the ETER website as follows:

  • First, we recommend obtaining your credentials with the registration and then login to the ETER website. In this way, you will have access to the most detailed version of the database, including also the values behind several codes (such as “c” and “s”).
  • Once logged in, you have to select “Search HEI Data” (on the top left of the main webpage).
  • In the search menu (https://www.eter-project.com/#/search), you have different opportunities:
  1. “Export All Variables for All Years and All Countries” (see the yellow button). Using the drop down menu, you can choose “Export All Variables”. A “Display and Export Settings” box will appear. You have to check that “variable name” is selected. Then select your preferred export format on the right (again with the drop down menu) and choose option “Machine Ready (STATA)”. At the end, you have to click on “Export” (in yellow) on your left. A .csv file will be downloaded.
  2. Select a sample of variables. To select them, you have to flag on the corresponding label; you may also select all the variables included in the section considered (such as Basic Institutional Descriptors or Geographic information). Then:
    • If you want to download the selected variables for all years and all countries, you have to click again the yellow button “Export All Years and Countries”. Then, using the drop down menu, select “Export Selected Variables”. The display setting box will appear. As in point 1., you have to check that the “variable name” on the left is flagged, before selecting the Export Format on the right (again with the drop down menu), which should be “Machine Ready (STATA)”. At the end, you have to click “Export” (in yellow) on your left. A .csv file will be downloaded.
    • Otherwise, if you want to select the Country/Countries or the year/years you have to click on the green button “Select and continue”. In the following menu, you will choose years (“Select years” on the top left side) and country (“Select countries” on the top right side) then click on “Search HEIs”. The program will return all the data. Now you have to go to “Settings” and select the “Machine Ready (STATA)” checking that the “variable level” is flagged and Apply (yellow button). Finally, click on “Export data”, select from the dropdown menu “Export visible data”, get yourself a coffee and wait for the .csv-file.

How to import the ETER data into STATA

In our recent contributions and analysis, we used the latest version of STATA (15.1); please note that in former versions, several specific commands or functions could be missing.

Once you imported the data, you may have to face several additional matters related to the presence of non-numerical variables and special codes if you want to do calculations. They are summarized as follows.

  • Non-numerical variables: data for calculations must be numerical; if not (non-numerical – string – format), the whole column including the string variable cannot be used for computations. You can easily detect it because all the data are in red (while numerical variables are in black).You can then transform strings in numeric data using the corresponding syntax below.

In ETER, all variables are recognized as strings due to the presence of special codes, like “m” for “missing”. By selecting the STATA export option in the ETER web application, special codes are automatically recoded (“m” becomes “.m”) so that they are recognized as missing values for numerical variables. Hence, the corresponding variables are automatically turned into numeric.

  • Special codes: you need to know that value fields within the ETER data can contain special value codes. This means that a value in ETER can either be a number or a code which give you some additional information (e.g. “a” means not applicable, “m” means “missing” etc.; see a full list of all value codes here). If you keep the original format, columns containing special values are recognized as strings instead of numeric, thus the corresponding column could not be used for calculations (as highlighted in the above paragraph). In order to corrrect this, you have the possibility of running several commands for the data file preparation. In this case, the cells of the dataset containing the above-mentioned codes will be recoded. Our suggestion in this case is to run as listed in the specific section of our script below.
  • Moreover, the special code “a” means “not applicable”, for example for the number of PhD students when the HEI does not award a PhD. For many purposes, codes “a” can be turned into “0”.

Congratulations, you have now imported the ETER data into STATA!

Useful coding snippets

*File import. You can import the whole dataset selecting “file”, “import”, “Text data (delimited, *.csv, …). Once you selected the file to be imported (with “browse”), you have to opt for “delimiter: custom” selecting “;”. Please be sure that the in the dropdown menu “text encoding” the option UTF-8 has been selected (in this case you will read all the characters within the file such as, for example, Arabic or Coptic). The corresponding script is then the following:

import delimited ("YOURDATAPATHFILE NAME.csv", delimiter(";")encoding(utf8)

*file save in STATA version (.dta). In this way you can save a version of the file that can be directly opened (with “file” “open”, it is not necessary to use import) into STATA. The corresponding script is listed below:

save "YOURDATAPATHFILE NAME.dta"

*How to transform all string variables in numerical variables and replacing commas. This command is necessary when you have data that are not recognized as string (due to the presence of special codes such as “.m” “.a”, etc..) and thus not included in your calculations. With “destring” non numerical variables will be turned in numerical ones; they then could be used for the analysis.

destring, replace dpcomma

*How to create a dummy variable for the legal status (where 0  will include both the original codes “0” and “2” representing the public/private government-dependent HEIs and 1 the private ones). This is useful since ETER includes few government-dependent institutes and these are in fact very similar to the public ones.

recode legalstatus (2=0) (0=0) (1=1), generate(legalstatus_binary)

recode legalstatus (2=0) (0=0) (1=1), generate(legalstatus_binary)

*How to generate a Ph.D/not Ph.D. dummy. This syntax permits to obtain a dummy variable assuming the value “0” if the higher degree delivered are ISCED 5, 6 or 7 and the value 1 if the HEI has also the ISCED 8.

recode highestdegreedelivered (1=0) (2=0) (3=1), generate(phdawarding)

*How to fill in missing for publication and EU.FP data. Missing data mean that these institutions have not been identified in the source databases (Web of Science and EUPRO), then they can safely be set to ‘0’ for the analysis. This script applies only to the RISIS-ETER version of the database (risis-eter.orgreg.joanneum.at).

recode publications meannormalizedcitationscore (missing = 0)

*How to recode students breakdowns when the code is “a”. The code “.a” in the cells that refer to students enrolled and graduates breakdowns means that the value is not present since the HEI does not deliver that kind of degrees. You can choose to turn the corresponding value to zero to avoid dropping these cases from the analysis. The scripts below are written in their general form, thus all the variables that name start with “studentsenrolled” and “graduates” will be affected by the recoding.

recode totalgraduates*(.a=0)
recode totalstudents*(.a=0)
recode studentsenrolled*(.a=0)
recode graduates*(.a=0

*How to reduce missing values in the Academic Staff. The following script permits to fill missing values using the linear regression of FTE and HC for the Academic Staff.

regress totalacademicstafffte c.totalacademicstaffhc#i.phdawarding, noconstant
predict linear
generate newacademicstaFTE=totalacademictafffte
replace newacademicstaFTE= linear if newacademicstaFTE==.m

You are now ready to use the full variety of the ETER dataset for your research. If you have questions to this post, please contact us at eter@eter-project.com. If you have additional questions on the ETER project, technical or not, do not hesitate to contact us.