Chinese bank deregulation: from data extraction to data analysis

  1. Herfindahl-Hirschman index
  2. Concentration ratio for the ‘Big Four’ banks
  3. 1. Download the data
  4. 2. Get the address and city name
  5. Find the branches located in the city
  6. Find the city when the branch is located in a lower administrative district
  7. Find city ID
  8. 3. Extract bank type (SOB, CCB, Rural, etc)
  9. Retrieve the bank type using the name or the type

China went through a deep change regarding how its financial market works, and more specifically the banking industry.

In this article, I’ll describe China’s deregulation of the banking system. The first part will detail the banking industry in China and will detail the methodology to fetch the data, construct the dataset and calculate the deregulation metrics.

This article was motivated for two reasons. First of all, I needed to include deregulation in one of my most recent papers about financial development and pollution emission in China. Secondly, the data is available online but it wasn’t an easy task to get them. In fact, I had to use the AWS, Google Map API, and GeocodePy just to download the data. Therefore, I thought it to be interesting to detail the strategy to download the data leveraging AWS cloud computing (API Gateway + Lambda + EventBridge) and make the data and code public.

If you are only interested in the method to get the data, skip the first part and jump directly to the strategy

If you read this article, you will learn:

  • some history of the banking system in China
  • Download data from the internet without restriction (leveraging AWS Lambda and API gateway)
  • Use fuzzy string matching
  • Use transformers (sentence embedding) to detect locations

Part 1

1. China’s banking system and the development of the capital allocation system

China has the largest bank loan market globally (Gao, 2019); nonetheless, it is relatively inefficient and fragmented. The literature studying the Chinese banking sector is abundant and well-documented. There is no ambiguity on the credit allocation inefficiency caused by the regional segmentation of the market (i.e., resources are not mobile across provinces) and mostly favored state-owned enterprises (Boyreau-Debray, 2005; Jarreau, 2014). Shreds of evidence exist in favor of the idea that private firms are discriminated against in the credit market, especially by state-owned banks. The soft budget constraint is ubiquitous in China, where state-owned firms can easily raise a large amount of money, regardless of the profitability or the risk of default.

In China, bank loans are the most common financing source, accounting for 75% of the total credit supply (Li, 2018). Among that, three quartered are issued by the big four under the direct control of the government until 2006 (Industrial and Commercial Bank of China, China Construction Bank Corporation, Agricultural Bank of China, and Bank of China). Joint-venture equity banks and city banks account for the remaining quarter. Ferri (2009) evaluates that the non-state-owned banks display better performances and hold in their book far fewer non-profitable loans (NPL).

Recognizing the four State-owned Commercial Banks (SBOC) apparent failure to allocate capital efficiently, the Chinese government implemented substantial banking sector restructuring. It started in 1994, when the government allowed more players, primarily with city commercial banks, owned by local governments and firms, and private shareholders, and in 1996 with the apparition of foreign banks. The market penetration of those banks was gradual, as shown in figure 1. The market share of non-SOBC was at its lowest in 1999 and kept growing since then.

The development of the banking system in China can be divided into three stages

  • The first stage is from the founding of the People’s Republic of China in 1949 to the implementation of the reform and open-door policy in 1978.
  • The second stage is from 1978 to 2001, the year China entered WTO.
  • The third stage is from 2001 to now.

China’s banking system originated in 1949. Because China’s banking system has not been completed, the People’s Bank of China (PBC) had to act not only as the central bank but also as the commercial bank from 1949 to 1978. Before the reform and open-door policy in 1978, China operated under a centralized system that all industries have been firmly controlled by the government

After the economic opening policies in 1978, the banking system has been reformed continually as well. In 1983, the ‘Big Four’ banks (i.e., BOC, ABC, CCB, and ICBC) took over the commercial bank business from PBC.

From 1987 to 2005, twelve shareholding commercial banks were established in succession. These share-holding commercial banks dramatically increased competition among banks (Chemmanur et al., 2019).

The China Banking Regulatory Commission (CBRC) was established in 2003 to supervise and administer banks as well as maintain the legal and steady operation of the banking industry

The ‘Big Four’ were transformed from fully state-owned to shareholding banks through recapitalization and IPOs.

In 2003, CBRC released some reforms to ease restrictions on the banking system, such as allowing foreign banks to operate in China, allowing shareholding commercial banks to set up branches in some counties, and removing entry restrictions on opening new branches in a city where a shareholding commercial bank had already set up branches in

The successive wave of the introduction into new types of financial institutions undeniably increased competition and spurs efficiency, especially in the allocation of credit. Not surprisingly, the rate of NPL decreases subsequently by the end of 2002 (as a response to the decrease in market share of the SOBC). The liberalization and the reform of the banking system have transformed the urban credit cooperative into commercial banks, allowing foreign banks, limiting the state-centered management system, etc., led to an improvement of the financial landscape in China (better profitability and less risk).

2. The development of city commercial banks

A variety of new bank types started to appear in the Chinese financial system in the mid-1980 s including urban and rural credit cooperatives trust and investment companies. Their role remained nevertheless minor for the most part of the 1990s

At the launch of the 1994 reform, the four state-owned banks held 80 of the total deposits and loans in the banking system. Two major changes occurred:

  • First there is the development of urban credit cooperatives UCCs which later became city commercial banks CCBs
  • Second, there is the creation of new banks including the creation of the current dozen national and regional joint-stock commercial banks eight of which have foreign investors

The UCCs were the most dynamic of the new financial institutions emerging in the mid-1980s. Their comparative advantage in using local information and monitoring and enforcing sanctions on borrowers allowed them to circumvent better the traditional information asymmetry than national state-owned banks. They were also subject to less regulation and were thus able to respond effectively to the growing demand for investment loans by enterprises both in the state and non-state sectors

Starting in 1995 the UCC s were restructured into urban cooperative banks which were renamed CCBs in 1998. The capital structure of to be CCBs was set up so as to have local governments play a significant role but also to include shares from urban enterprises and residents.

The CCBs differ from the state-owned banks in one important dimension they have many shareholders. Although some of these shareholders may themselves be in the public sector or belong to either the public administration or the SOE system the plurality of shareholders encourages better corporate governance and performance as it significantly reduces political interference in bank business.

The growth of CCBs reflects the government’s efforts to liberalize and reform the banking sector. At first city commercial bank business was confined to the urban districts of their home city. From 2006 onwards some CCBs that met certain size and experience conditions were allowed to open branches in other cities in their home province and even in other cities in other provinces. In 2007, CCBs were allowed to expand their operations to non-urban areas further entering into head-on competition with traditional financial actors

These successive reforms have reduced the geographical segmentation of the banking market which was one of the main restrictions on the ability of the CCBs to compete effectively with public commercial banks. They also prompted a series of mergers and acquisitions. Starting in 2005 a number of CCBs merged so as to create larger entities. This restructuring continued as the government encouraged qualified domestic and foreign strategic investment in the CCBs and even allowed some of them to make an initial public offering on the Hong Kong Stock Exchange.

In 1997 CCBs operated in only 27 cities a figure that rose to 47in 1997, 71 in 2002, 99 in 2007, and 144 in 2012 CCBs started in a small number of mostly provincial capital cities, and around 178 in 2018.

Number of CCB

3. Deregulation and concentration

For my recent paper, I needed to go a little further by calculating two metrics to capture the deregulation. I follow the literature and construct the following score as a proxy for bank competition:

  • 1-Herfindahl-Hirschman index
  • 1- the concentration ratio for the ‘Big Four’ banks

Herfindahl-Hirschman index

The Herfindahl–Hirschman Index (HHI) is a proxy for the level of bank competition. This measure is widely used to describe competition in the banking system in previous literature (Alegria & Schaeck, 2008; Mercieca et al., 2009).

represents the HHI of the local banking market in city c and is calculated on the basis of the number of bank branches of all banks k locally present for each year (with N the total number of banks in the country). The deregulation metric is 1 minus HHI. The figure below indicates how China makes significant progress in deregulating the banking system over time.

Concentration ratio for the ‘Big Four’ banks

China’s banking system is dominated by the large and inefficient, state-owned banks, especially colloquially known as the ‘Big Four’ (Allen et al., 2008), which includes the Bank of China (BOC), Agricultural Bank of China (ABC), China, Construction Bank (CCB) and Industrial and Commercial Bank of China (ICBC).

The ‘Big Four’ is by far the primary external financing source in China while their loan concentration declines significantly throughout our entire sample period (see figure 1). Therefore, we also calculate the concentration ratio for the ‘Big Four’ banks to measure bank competition using the following:

with Bank of China (BOC), Agricultural Bank of China (ABC), China Construction Bank (CCB), and Industrial and Commercial Bank of China (ICBC).

The graph below indicates a significant decrease in market power over time from the Big Four.

Part 2

Strategy to construct the dataset

The second part details the methodology to get construct the dataset. It wasn’t an easy task and I thought it relevant to share it with the audience.

The strategy works as follows:

  1. Download the data from
  2. Get the address and city name
  3. Find city ID -> China encodes the city using a specific ID, which we use in the economic literature on China
  4. Extract bank type (SOB, CCB, Rural, etc)

1. Download the data

The data is made available by the China Banking and Insurance Regulatory Commission at this address:

The website is fairly simple to scrap with requests. If we look carefully at the request, we can spot the endpoint:

The dynamic part in the URL is kCEEKX which is different every day. The payload is a post request with two arguments:

  • start: index to begin with
  • limit: maximum of 10. This argument is actually useless

if we move the cursor start/limit with an increment of 10, we can get all the data

However, an issue arises because of the message “您的操作过于频繁,请五分钟后再试。”. It means we need to wait 5 minutes before getting new data. A naive approach would be to set a timer and wait. BUT we can only fetch 30 pages at a time. There are 22726 pages in total, hence it would take at least 63 hours (22726 * 5/30).

This is too long! here come the trick -> rotate the IP!

but wait, rotating the IP won’t make it way faster, it will simply allow bypassing the error message. The second trick is to leverage AWS to make a lot more requests per time using a Lambda function.

Lambda & Api Gateway

We know there are 22726 pages to get the information, or 227257 addresses with an increment of 10.

Here is how it works:

  • Create many JSON files with the values to send in the payload
  • Save it to S3, so that it can trigger a Lambda function
  • The lambda function downloads the data and saves it back to S3

Using this technique, we can send multiple requests per lambda function, and since we sent a lot of JSON files in the S3, we can “parallelize” the number of calls. To make sure all the pages are downloaded, we create a loop until the S3 has all the pages

  1. The payload

The first step consists to create all the possible values the payload can take

`nb_calls = [int(i) for i in list(np.arange(0, 227270, 10))]`

for instance, nb_calls[:3] will yield to 0, 10, 20 -> that’s three payloads, which will download 30 addresses.

We then create a list with the chunks of payload -> one Lambda function will try to make `n` requests. In fact, we want to make 100 requests for each lambda function, so it is about 228 requests in total. Note that, we add a 15 seconds timer between chunks. We play nice with the server ;)

The JSON file looks like this.

2. Save it to S3, so that it can trigger a Lambda function

The JSON is then saved to S3, which will trigger the lambda function.

The whole process took 2 hours, down from a theoretical 60 hours with requests

3. The lambda function downloads the data and saves it back to S3

The core of it is the Lambda function. We leverage API gateway to rotate the IP when we get the error message. The function comes from the package

We define a function to rotate the IP on the fly

Note that, we forbid lambda to store the cookies.

We can loop through the range we defined in the JSON, which is our payload

In the end, we have all the data stored in S3, and we can load it into a dataframe. The dataframe should have 227257 rows

2. Get the address and city name

The issue with the data from the official website is the city name is not properly displayed in a separate column. Instead, it is included in the name of the branch i.e the address. In this example, “中国邮政储蓄银行股份有限公司丹凤县广场营业所”, the city is 丹凤 and it belongs to a county. For the sake of our analysis, we want to get the city level. We have a file with the different city ID, which we need to use to match other economic data provided by the Chinese government. See in the image below, we want to find the “geocode4_corr” based on the “cityen” or “citycn”.

We need to proceed in two steps:

  1. Find the branches located in the city
  2. Find the city when the branch is located in a lower administrative district (county, village, etc)

Find the branches located in the city

For this part, a simple regex match will do the job: 121104 rows over 227257 found.

We need to find the city for 115494 rows

Find the city when the branch is located in a lower administrative district

For this part, we will rely on Google Map API to get the address and other relevant information such as the longitude and latitude. Let’s take this example, if we pass this address in the API, “中国邮政储蓄银行股份有限公司丹凤县广场营业所”, we can find the city -> Shangluo and we know that Shangluo is in our dataset.

Google API is not free, so the trick is to create a new account to enjoy $300 of credit and another $100 if we have a business email address.

The issue with this strategy is we don’t know exactly the city because it is stored as a tuple (column location in the image below).

To tackle this problem, we use the longitude and latitude to get the city name from the library Geopy

To faster the job, we leverage Lambda function. The process takes about 30 minutes.

Find city ID

We have most of our branches with a city but for 24480 rows we don’t have a city name. To retrieve them, we will use fuzzy string matching.

We create the list of addresses to extract the city name. Below, we can see the first three candidates.

and the possible values are stored in another list -> this is a combination of the name in English and in Chinese

We create a function to find the best score. So if we look at our first example, “jiuxianzhen song county luoyang henan china”, the best result is “luoyang”.

To faster the process, we use multithreading. It takes about 5 minutes with 4 cores.

At last, we just need to merge the city geocode data to get the city ID.

In the end, using AWS Lambda, Google Map et Geopy, we manage to get 175555 rows over 227700 (77%) of the data in less than 3 hours

3. Extract bank type (SOB, CCB, Rural, etc)

The last step is not straightforward either. We need to find the bank details, namely:

  • 1.政策性银行: policy bank
  • 2.国有控股大型商业银行: State-controlled large commercial bank
  • 3.股份制商业银行: Joint-stock commercial bank
  • 4.城市商业银行: city ​​Commercial Bank
  • 5.农村商业银行: Rural commercial bank
  • 6.外资银行: foreign banks
  • 7.其他: other
  • 8.农合行: Rural Credit Cooperative
  • 9.农信社: Rural Credit Cooperative
  • 10.三类新型农村金融机构: Three types of new rural financial institutions

For the first three types, the type is included in the name or we already know the name of the banks:

Policy banks: China has three policy banks. Among them, China Development Bank was incorporated in December 2008 and officially defined by the State Council as a development finance institution in March 2015

  • 中国农业发展银行
  • 国家开发银行
  • 中国进出口银行

State-owned Commercial Banks: China has six state-owned commercial banks. These banks are ranked by their Tier 1 capital amount as of 2018.

  • 中国工商银行
  • 中国建设银行
  • 中国银行
  • 中国农业银行
  • 交通银行
  • 中国邮政储蓄银行

But the challenge is to find the city banks, which is our primary interest.

To make sure the branch is a city branch, we use the data provided by CSMAR. CSMAR, short for China Stock Market & Accounting Research Database, is a comprehensive research-oriented database focusing on China Finance and Economy.

To find the bank type, we proceed as follows:

  1. Retrieve the bank type using the name or the type
  2. Use CSMAR and FuzzyMatching to get the city name

Retrieve the bank type using the name or the type

In this part, a simple regex matching will do the job. In fact, we retrieve 75% of the dataset using this method.

Most of the branches we found are SOB, rural, or policy which is not a surprise.

The challenge comes with the city branch because they don’t include enough information in the name. Here is some example of banks we need to find

One thing we are sure of is the name of the city should be included to be denominated as a city bank. Therefore, we will proceed in two steps:

  1. Use a regex matching to find if the province or city is included in the name
  2. Use a Transformer to detect if the name includes a location

Both steps are relatively easy. The regex step is a simple matching while the second step makes use of a score to exclude non-location-included in the address

We use this transformer to detect a location or not

It results in this dataframe, with 1774 different banks to find the type from the CSMAR file

As we saw previously, there is barely a direct match between the address from the China Banking and Insurance Regulatory Commission dataset and CSMAR dataset. To limit the use of rules, we leverage the Polyfuzz library to find the most likely candidates. We filter scores above 85 to avoid False positives.

Below are examples of false positives, which we remove because we don’t have enough information

In the end, here is the final distribution of branch types

The datasets/codes are available upon request: [email protected]

Happy coding!


  • Boyreau-Debray, G. and S.-J. Wei (2005, March). Pitfalls of a State-Dominated financial system: The case of China. Technical Report 11214, National Bureau of Economic Research
  • Ferri, G. and L.-G. Liu (2009). Honor thy creditors before an thy shareholders: Are the profits of chinese State-Owned enterprises real? Technical report, Hong Kong Institute for Monetary Research
  • Gao, H., H. Ru, R. Townsend, and X. Yang (2019, May). Rise of bank competition: Evidence from banking deregulation in china. Technical Report
  • Jarreau, J. and S. Poncet (2014). Credit constraints, firm ownership and the structure of exports in China. International Economics 139 (139), 152–173
  • Li, Y. A., W. Liao, and C. C. Zhao (2018, October). Credit constraints and firm productivity: Microeconomic evidence from china. Research in International Business and Finance 45, 134–149