AI in processing trade documents – Anomaly Detection

In the era of big data, the Census and Statistics Department (C&SD) has leveraged artificial intelligence (AI) and data science to enhance operational efficiency and improve the quality of its statistical services. Processing around 70,000 import/export declarations daily for compiling external merchandise trade statistics, C&SD faces the complex task of verifying millions of commodity classifications and declared unit values, traditionally relying on extensive manual review of free-text descriptions.

Since 2018, C&SD has pioneered the use of AI models to analyse unstructured data. By employing deep learning algorithms trained on millions of labeled commodity descriptions, C&SD has utilised internal resources and expertise to develop an automated system for commodity coding and unit value anomaly detection. This innovative approach has significantly reduced manual checks and improved data quality. During the pandemic, the AI models ensured smooth routines despite severe manpower constraints, especially in the period when special work arrangements for government employees were implemented and a portion of staff had been arranged to work from home as far as possible.

By early 2024, C&SD had incorporated this technology into its workflows. As of now, this has achieved a remarkable reduction of over half in manpower for manual checking procedures. This efficiency gain enabled the reallocation of resources to establish two strategic branches: the Data Science Branch (1) and the Social Data Development Branch. This positions C&SD to strengthen its capabilities in analysing big data and capitalising on digital transformation opportunities, with a view to delivering more sophisticated statistical analyses and higher quality of statistical services across various domains.

Please watch the video (Cantonese only) below for details.

[Show the video contents]

Male Statistician:

The Census and Statistics Department (C&SD) handles over a million trade declarations every month to compile the "External Merchandise Trade Statistics".

In the past, computer systems struggled to process textual data, and we were only able to manually validate a small portion of the trade declarations. This was really a significant challenge!

In recent years, we have developed two AI models using deep learning techniques to simulate the human brain's ability to recognise and analyse the information on trade declarations.

We trained these AI models using millions of records with commodity descriptions, enabling them to automatically validate the textual description on every new trade declaration, verify the commodity codes, and calculate whether the values and quantities of the commodities are reasonable.

By early 2024, we had fully implemented these AI models to process trade declarations, achieving promising results! The AI models can now validate approximately three million trade declaration records in just two and a half hours, significantly improving the quantity of trade declarations being validated and also the quality of statistics, while reducing over 40% of the manpower involved.

With the resources saved, we have established two new branches, Data Science Branch and Social Data Development Branch, and expanded our big data team to focus on the promotion and training on big data applications.

Female Statistician:

Population Censuses, conducted every ten years, traditionally required the participation of the entire population, with 10% completing a long questionnaire and the remaining 90% answering a short one. Population By-censuses, conducted between two full censuses, require only 10% of the population to complete the long questionnaire. The short questionnaire asks for basic demographic information such as age, sex, and whether the respondent is a permanent resident. These data are used to calculate Hong Kong's population base.

After analysing the 2021 Census data, the department discovered that administrative records from the Immigration Department, such as birth, death, and movement records, could already accurately reflect the demographic structure, which fulfill the same purpose as the short questionnaire.

Starting from 2026, we will conduct a population census every five years, with the scale similar to population by-census, where only 10% of the population will be selected to answer the long questionnaire. Together with the use of administrative data to calculate the population base, we can achieve results as accurate as a full census.

Additionally, we are actively utilising administrative data from other departments with a view to trimming down the long questionnaire. For example, data on floor area can be provided by the Housing Department and the Rating and Valuation Department, and data on welfare subsidies can be obtained from the Social Welfare Department, etc . By matching these data with our census records, we can reduce the number of questions respondents need to answer, saving both costs and respondents’ time!

The department estimates that by leveraging administrative data and re-organising workflows, we can save 40% of the costs for the 2026 and 2031 censuses combined, amounting to approximately 680 million dollars!

Both Statisticians:

In the future, the C&SD will continue to explore the applications of new technologies, streamline workflows, and optimise manpower to provide higher-quality statistical services to the government and the public!