AI-Powered Industry and Occupation Coding in General Household Survey

AI-Powered Industry and Occupation Coding in General Household Survey

The General Household Survey (GHS) of the Census and Statistics Department (C&SD) collects information on the demographic and socio-economic characteristics of the population of Hong Kong, including the industry and occupation of the respondents. The traditional approach for processing the text descriptions of industries and occupations in the questionnaires involves manual review by coders. Coders have to categorise the text descriptions into different classes according to the prescribed industry and occupation coding standards. This manual coding approach, though offering a very high level of accuracy, is relatively time-consuming and expensive in terms of human resources.

Some examples of industry coding are as follows:
Questionnaire response Industry code
Industry description
Chain supermarket → 470 Retail trade
Foreign currency exchange service → 649 Other financial service activities
Government secondary school → 852 Secondary schools
Some examples of occupation coding are as follows:
Questionnaire response Occupation code
Job title Main tasks or duties Qualifications required
Secondary school teacher Responsible for mathematics curriculum design and teaching University degree → 23 Teaching professionals
Heavy truck driver Long-distance freight transportation Heavy goods vehicle driving license → 83 Drivers and mobile machine operators

With the rapid development of Artificial Intelligence (AI), especially breakthroughs in Large Language Models (LLMs), the C&SD actively adopts AI to perform coding tasks, which complements or would even partially replace the traditional manual coding process. The initial results are encouraging. The AI model has performed satisfactorily in terms of accuracy, processing speed, and consistency.

The C&SD adopts a risk-based mechanism of automatic coding, which divides all cases into high risk and low risk categories according to the expected accuracy of automatic coding. For high risk cases which are of lower expected accuracy, manual coding will still be performed by coders, and reviewed by supervisors under sample checking; for low risk cases which are of higher expected accuracy, the model predicted coding results will be accepted, but some cases will still be sampled for manual review to monitor model performance and identify emerging occupations and industries that were not covered in model training. Taking the survey round of March 2026 as an example, more than half of the coding is performed by automatic coding, thereby enhancing coding efficiency significantly. Machine coding and manual coding can complement each other. In addition to ensuring coding accuracy, it also allows coders to continue accumulating relevant experience and professional knowledge.

In the long run, AI is expected to yield increased coding efficiency, save labour resources and improve coding quality. Looking ahead, the C&SD will continue to fine-tune and retrain the models to ensure that they can adapt to the revisions to coding standards and the emergence of new industries and occupations, and further enhance the automatic coding process of the GHS.