ANC Workshop - Taha Ceritli Speaker : Taha Ceritli Title: ptype-cat: Inferring the Type and Values of Categorical Variables Abstract: Type inference is the task of identifying the type of values in a datacolumn and has been studied extensively in the literature. Mostexisting type inference methods including ptype (a probabilistic typeinference method, Ceritli et al. 2020) support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical and need to be transformed into categorical manually by the user. In this work, we propose ptype-cat that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting ptype, which is robust against missing data and anomalies. Combining these methods, we provide enhanced type inference capability for Pandas DataFrames and automatic documentation of data dictionaries in the well-known Attribute-Relation File Format. Our experiments show that our method achieves better results than existing applicable solutions. Mar 16 2021 11.00 - 12.00 ANC Workshop - Taha Ceritli Tuesday, 16th March 2021 online
ANC Workshop - Taha Ceritli Speaker : Taha Ceritli Title: ptype-cat: Inferring the Type and Values of Categorical Variables Abstract: Type inference is the task of identifying the type of values in a datacolumn and has been studied extensively in the literature. Mostexisting type inference methods including ptype (a probabilistic typeinference method, Ceritli et al. 2020) support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical and need to be transformed into categorical manually by the user. In this work, we propose ptype-cat that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting ptype, which is robust against missing data and anomalies. Combining these methods, we provide enhanced type inference capability for Pandas DataFrames and automatic documentation of data dictionaries in the well-known Attribute-Relation File Format. Our experiments show that our method achieves better results than existing applicable solutions. Mar 16 2021 11.00 - 12.00 ANC Workshop - Taha Ceritli Tuesday, 16th March 2021 online