Scraping Chinese or Japanese Language Text Websites

Scraping Chinese or Japanese language text websites is easy in Agenty, the application automatically detects and apply the correct character encoding, and foreign language displayed on website header when you create the web scraping agent using Agenty chrome extension. And then, the scraper read and write the text in it's native language encoding selected automatically.

We can also change the encoding manually by editing the agent, if the default one doesn't work correctly. This happens usually when the website is using other language/charset then, the one which are mentioned in website header. For example, the HTML code given below from a Japanese website header tells that the character encoding used by this website is charset=shift_jis

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="ja" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">

Encoding

So, we can follow the steps given below in order to change the encoding for our scraping agent to extract this Japanese website:

  1. Edit the scraping agent by clicking on the "Edit agent" button : 

    editing the scraper
     
  2. Go to "Advance Options" tab and then select the correct encoding in "Character Encoding" option as in screenshot given below :

    Japanese language encoding
  3. Save the agent back and Re-Run

Execution

After the correct encoding, configuration is applied in agent, the web scraper will produce the output correctly in clean format for Japanese, Chinese or any other language. For example, if you see the screenshot below, I ran the kakaku agent, the text is formatted correctly in output table with the same Japanese language as on website. So, whether it's Chinese text, Japanese text or any other language, as far as the correct encoding is selected in Agenty online scraping app, the scraper will extract the text exactly same as it appears on the website or displayed on any other browser like Chrome or Firefox.

Japanse or Chinese language text scraper

Output

Once the scraping job is completed and output is ready in the table,  you can download the output in CSV, TSV or JSON format. We recommend the TSV(Tab Delimited) format for Non-English language websites as that's the format used to extract and store the web data on cloud-server and converted into different format on download requests.

[
  {
    "Field1": "2017年8月9日 09:42 [1052570-2]",
    "Field2": "デザイン5\n携帯性5\nボタン操作5\n文字変換5\nレスポンス5\nメニュー5\n画面表示5\n通話音質無評価\n呼出音・音楽5\nバッテリー無評価\n\n\n\n\n満足度5"
  },
  {
    "Field1": "2017年8月6日 10:40 [1051822-1]",
    "Field2": "デザイン3\n携帯性4\nボタン操作3\n文字変換3\nレスポンス3\nメニュー2\n画面表示4\n通話音質4\n呼出音・音楽3\nバッテリー3\n\n\n\n\n満足度2"
  },
  {
    "Field1": "2017年8月1日 21:49 [1050666-1]",
    "Field2": "デザイン5\n携帯性5\nボタン操作5\n文字変換3\nレスポンス4\nメニュー5\n画面表示5\n通話音質5\n呼出音・音楽5\nバッテリー5\n\n\n\n\n満足度5"
  },
  {
    "Field1": "2017年7月27日 10:49 [1045187-3]",
    "Field2": "デザイン4\n携帯性4\nボタン操作4\n文字変換2\nレスポンス3\nメニュー3\n画面表示3\n通話音質3\n呼出音・音楽3\nバッテリー5\n\n\n\n\n満足度4"
  },
  {
    "Field1": "2017年7月23日 19:08 [1048222-1]",
    "Field2": "デザイン5\n携帯性3\nボタン操作3\n文字変換5\nレスポンス3\nメニュー3\n画面表示5\n通話音質3\n呼出音・音楽3\nバッテリー5\n\n\n\n\n満足度4"
  },
  {
    "Field1": "2017年7月20日 07:25 [1047291-1]",
    "Field2": "デザイン5\n携帯性5\nボタン操作5\n文字変換5\nレスポンス5\nメニュー3\n画面表示5\n通話音質5\n呼出音・音楽5\nバッテリー5\n\n\n\n\n満足度5"
  },
  {
    "Field1": "2017年7月18日 23:31 [1010901-2]",
    "Field2": "デザイン5\n携帯性3\nボタン操作4\n文字変換5\nレスポンス5\nメニュー5\n画面表示5\n通話音質5\n呼出音・音楽5\nバッテリー5\n\n\n\n\n満足度5"
  },
  {
    "Field1": "2017年7月18日 18:04 [1046921-1]",
    "Field2": "デザイン5\n携帯性4\nボタン操作5\n文字変換3\nレスポンス5\nメニュー4\n画面表示5\n通話音質無評価\n呼出音・音楽無評価\nバッテリー5\n\n\n\n\n満足度5"
  },
  {
    "Field1": "2017年7月13日 22:30 [1045327-1]",
    "Field2": "デザイン4\n携帯性4\nボタン操作4\n文字変換無評価\nレスポンス4\nメニュー3\n画面表示4\n通話音質4\n呼出音・音楽3\nバッテリー5\n\n\n\n\n満足度4"
  },
  {
    "Field1": "2017年7月13日 19:27 [1044888-2]",
    "Field2": "デザイン3\n携帯性4\nボタン操作3\n文字変換無評価\nレスポンス3\nメニュー3\n画面表示5\n通話音質4\n呼出音・音楽5\nバッテリー4\n\n\n\n\n満足度4"
  },
  {
    "Field1": "2017年7月9日 04:05 [1043954-1]",
    "Field2": "デザイン4\n携帯性5\nボタン操作2\n文字変換2\nレスポンス3\nメニュー4\n画面表示5\n通話音質3\n呼出音・音楽5\nバッテリー3\n\n\n\n\n満足度3"
  },
  {
    "Field1": "2017年7月9日 02:23 [1043949-1]",
    "Field2": "デザイン3\n携帯性2\nボタン操作3\n文字変換3\nレスポンス2\nメニュー3\n画面表示4\n通話音質4\n呼出音・音楽3\nバッテリー4\n\n\n\n\n満足度2"
  },
  {
    "Field1": "2017年7月7日 17:50 [1043528-1]",
    "Field2": "デザイン5\n携帯性5\nボタン操作5\n文字変換無評価\nレスポンス3\nメニュー4\n画面表示5\n通話音質2\n呼出音・音楽5\nバッテリー3\n\n\n\n\n満足度4"
  },
  {
    "Field1": "2017年7月5日 11:28 [1042880-1]",
    "Field2": "デザイン5\n携帯性4\nボタン操作4\n文字変換3\nレスポンス4\nメニュー4\n画面表示4\n通話音質無評価\n呼出音・音楽4\nバッテリー4\n\n\n\n\n満足度5"
  },
  {
    "Field1": "2017年7月4日 07:31 [1042520-1]",
    "Field2": "デザイン4\n携帯性4\nボタン操作4\n文字変換4\nレスポンス5\nメニュー4\n画面表示5\n通話音質無評価\n呼出音・音楽4\nバッテリー3\n\n\n\n\n満足度5"
  }
]

Accept language header

Be sure, your Accept-Language header in header section is set to * (asterisk) to allow HTTP reader to accept any language websites like Chinese, German, French etc. 

Or you can use the specific language code in value as well - For example : 

  • Chinese - Accept-Language: zh
  • English - Accept-Language: en
  • French - Accept-Language: fr
  • Spanish - Accept-Language: es 
  • German - Accept-Language: de
  • Japanese - Accept-Language: ja

How to find the correct encoding

Now, the question is how to find the correct encoding used by website and what if a website is using some other encoding in actual while mentioned something else in header charset. Yes, there might be some cases where the character encoding mentioned in charset is different then actual or may be nothing mentioned as well. 

For those cases, we use the W3 Validator, where you can enter the url of website or just paste the text (website source) to detect the actual encoding used on that language text. For example - If you see the screenshot below, I use the website url  and submit to validation, in result the w3 validator detected the shift_js encoding automatically.

automatic language encoding detection

So, if you are looking to extract data from Chinese or Japanese websites? - Signup with Agenty