ShiftJIS なサイトをスクレイピングしたい

という要望がいまだにあるとかないとか。とりあえずメタ情報を取得する場合はこんな感じになりました。

const scrapeIt = require('scrape-it');
const got = require('got')
const encoding = require('encoding-japanese')

async function main() {

  const url = "http://okamuuu.hatenablog.com/"  // ShiftJIS なサイトの URL を各自で探して見てください。
  const {body} = await got(url, {encoding: null})
  const detected = encoding.detect(body);
  const unicode = encoding.convert(body, {
    from: detected,
    to: 'UNICODE'
  })
  const html = encoding.codeToString(unicode)
  const result = await scrapeIt.scrapeHTML(html, {
    title: "title",
    metaInfos: {
      listItem: "meta",
      data: {
        name: { attr: "name" },
        content: { attr: "content" },
      }
    }
  })
  console.log(result)
}

main()

このようにhttp.request に {encoding: null} オプションを渡してバイナリのまま取得するとその後の処理が楽です。おしまい。

あいつの日誌β

働きながら旅しています。

ShiftJIS なサイトをスクレイピングしたい