AI-Powered Industry and Occupation Coding in General Household Survey

AI-Powered Industry and Occupation Coding in General Household Survey

The General Household Survey (GHS) of the Census and Statistics Department (C&SD) collects information on the demographic and socio-economic characteristics of the population of Hong Kong, including the industry and occupation of the respondents. The traditional approach for processing the text descriptions of industries and occupations in the questionnaires involves manual review by coders. Coders have to categorise the text descriptions into different classes according to the prescribed industry and occupation coding standards. This manual coding approach is relatively time-consuming and expensive in terms of human resources.

Some examples of industry coding are as follows:
Questionnaire response Industry code
Industry description
Chain supermarket → 470 Retail trade
Foreign currency exchange service → 649 Other financial service activities
Government secondary school → 852 Secondary schools
Some examples of occupation coding are as follows:
Questionnaire response Occupation code
Job title Main tasks or duties Qualifications required
Secondary school teacher Responsible for mathematics curriculum design and teaching University degree → 23 Teaching professionals
Heavy truck driver Long-distance freight transportation Heavy goods vehicle driving license → 83 Drivers and mobile machine operators

With the breakthrough development of Large Language Model (LLM) in Artificial Intelligence (AI), the C&SD is actively studying the use of AI models to perform coding tasks, which would complement or even partially replace the traditional manual coding process and increase coding efficiency and accuracy at the same time. Preliminary study has shown that the performance of the AI models is satisfactory.

The C&SD is studying the use of a risk-based mechanism of automatic coding, which divides all cases into two categories: high risk (i.e. cases with lower expected accuracy of automatic coding) and low risk (i.e. cases with higher expected accuracy of automatic coding). For high risk cases, manual coding will still be performed by coders, and reviewed by supervisors under sample checking; for low risk cases, the model predicted coding results will be accepted, but some cases will still be sampled for manual review to monitor model performance and identify emerging occupations and industries that were not covered in model training. Machine coding and manual coding can complement each other. In addition to ensuring coding accuracy, it also allows coders to continue accumulating relevant experience and professional knowledge.

In the long run, AI is expected to yield increased coding efficiency, save labour resources and improve coding quality. Looking ahead, the C&SD will continue to fine-tune and retrain the models to ensure that they can adapt to the revisions to coding standards and the emergence of new industries and occupations, and further enhance the automatic coding process of the GHS.