NtKinect

NtKinect: Kinect V2 C++ Programming with OpenCV on Windows10

Kinect V2 で取得した音声を Google Cloud Platform の Cloud Speech API で音声認識する

2017.12.20: created by

2017.12.23: revised by

目次へ

前提として理解しておくべき知識

NtKinect: Kinect V2 で音声を取得する

Google の Cloud Speech API を利用するために

Google Cloud Platform の音声認識 API が Cloud Speech API です。

まず、 Google Cloud Platform の Cloud Speech API の「無料トライアル」を申し込み、利用できるように設定して下さい。

「 Google Cloud Speech API ドキュメント」の「クイックスタート 5分で学習する」を読みながら、sync-request.json に記述した音声を curl で送りつけて、認識された音声が返される手順を確認して下さい。

sync-request.json

{
  "config": {
      "encoding":"FLAC",
      "sampleRateHertz": 16000,
      "languageCode": "en-US",
      "enableWordTimeOffsets": false
  },
  "audio": {
      "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
  }
}

curl を用いた要求の送出

授業で配布するプリントを参照して下さい。

access token を入手し、それを用いて HTTP プロトコルで Google Speech API を利用します。上記の例のように

gcloud auth application-default print-access-token  > token-file.txt

として、access token をリダイレクトでファイルに保存した場合、そのファイルのフォーマットは Little Endian の UTF-16 エンコーディングになることに注意して下さい。

アクセストークンが他人に漏洩するのを防ぐため、アクセストークンを含むファイルは、 Visual Studio のプロジェクトの外に置くことが推奨されています。

[注意]
上記のアクセストークンを入手するところで "Default Credential Authentification ..." というエラーが起きることがあります。それは Google Application Default Credentials を使う部分の設定がうまくいっていないからです。

このエラーを取り除く最も簡単な方法は次のコマンドを実行して「デフォルト認証用」ファイルを作ることです。次のコマンドを実行すると、Web ブラウザが起動されます。 Microsoft Edgeだと非常に遅いので、このコマンドを実行するときは「既定のアプリケーション」を Google Chrome に変更しておくことをお勧めします。

gcloud auth application-default login

「デフォルト認証用」ファイルは (ユーザのホーム・フォルダ)/AppData/Roaming/gcloud/application_default_credentials.json に作成されます。私の場合は C:/Users/nitta/AppData/Roaming/gcloud/application_default_credentials.json です。

この部分の説明は、Google の「クイックスタート」にはない(2017/12/25現在)ので、つまづく人が多いようです。

[注意2]
上記のアクセストークンを入手するところで "Default Credential Authentification ..." というエラーが起きることがあります。この対策としては、上記の[注意]で述べた「デフォルト認証用」ファイルを作る方法以外に、ダウンロードしたサービスアカウントのjsonファイル (上の例だと ynitta-XXXXX-XXXXXXXXX.json) へのパスをGOOGLE_APPLICATION_CREDENTIALS 環境変数に設定しておく方法もあります。

音声データについて必要な知識

Kinect V2 で取得した音声は次のような形式で wave ファイルとして保存されています。

属性名	属性値
フォーマット	WAVE_FORMAT_IEEE_FLOAT
Channel数	1
1秒当たりのサンプリング数	16000
1サンプル当たりのビット長	32

Google Speech で推奨されているロスレスな音声フォーマットは FLAC と LINEAR16 のみです (2017年12月20日現在)ので、WAVE_FORMAT_IEEE_FLOAT の音声データを上記の音声データ形式に変換する必要があります。

LINEAR16 とは「1サンプル当たりのビット長が 16 の符号付き整数で表現されたWAVE_FORMAT_PCM 形式」のことですから、 WAVE_FORMAT_IEEE_FLOA T形式のデータを変換するのは簡単です。 WAVE_FORMAT_IEEE_FLOAT 形式の各データは -1.0 から 1.0 の間の小数で表現されているので、 4byteごとに 32 bits float として取得してから 0x7fff = 32767 を乗算し、 INT16 に型変換するだけです。

音声データの変換 (32bit float WAVE --> 16bit int WAVE)

  FLOAT *p = (FLOAT *) ポインタ;
  INT16 *q = (INT16 *) ポインタ;
  for (int i=0; i<size/4; i++) {
    *q++ = (INT16) (32767 * (*p++));
  }

KinectV2_audio で保存した WAVE ファイルは、ファイルフォーマットの情報がファイルの先頭に46byteあり、音声データはその後から始まることに注意しましょう。

WinHttpについて

C++ REST SDK (コード名 "Casablanca")が Visual Studio 2017では NuGet できなくなりました。そのため、今回は WinHttp を用いて WWW サーバにアクセスすることにします。

音声認識をした結果 Google Speech API が返してくる JSON データは、別途解析する必要があります。今回の例では説明を簡単にするため JSON データの解析は行っていません。

プログラム作成の手順

「NtKinect: Kinect V2 で音声を取得する」の Visual Studio のプロジェクト KinectV2_audio.zipを用いて作成します。
ライブラリ WinHttp.lib をリンクするように、プロジェクトの設定を変更します。

ソリューション・エクスプローラのプロジェクト名 "KinectV2" の上で右クリックして「プロパティ」を選択します。

構成が「Release」 (または「アクティブ (Release)」「すべての構成」のどれか）であり、フラットフォーム名が「x64」(または「アクティブ (x64)」「すべての構成」のどれか)になっていることを確認した上で、「構成プロパティ」->「リンク」->「入力」->「追加の依存ファイル」に "Winhttp.lib" を追加します。

NtGoogleSpeech.hをプロジェクトに追加します。

NtGoogleSpeech.h を main.cpp があるフォルダに置いて下さい。それからNtGoogleSpeech.h をプロジェクトに追加します。「ソリューションエクスプトーラー」の「ヘッダーファイル」を右クリックしてメニューの中から「追加」「既存の項目」として「名前」で"NtGoogleSpeech.h"を指定して下さい。

NtGoogleSpeech.h

/*
 * Copyright (c) 2017 Yoshihisa Nitta
 * Released under the MIT license
 * http://opensource.org/licenses/mit-license.php
 */

/* version 0.32: 2017/12/23 */

#pragma once

#include <iostream>
#include <sstream>
#include <string>
#include <vector>

#include <Windows.h>
#include <Winhttp.h>

using namespace std;

class NtGoogleSpeech {
public:
  wstring host = L"speech.googleapis.com";
  wstring hpath= L"/v1/speech:recognize";
  int WaveHeaderSize = 46; /* header size of ".wav" file */
  string tokenPath = "";
  string accessToken = "";
  
  string getAccessToken() { return getAccessToken(tokenPath); }
  string getAccessToken(string path) {
    ifstream ifs(path);
    if (ifs.fail()) {
      stringstream ss;
      ss << "can not open Google Speech token file: " << path << endl;
      throw std::runtime_error( ss.str().c_str() );			\
    }
    string token;
    ifs >> token;
    if (token.length() > 2 && isUtf16(&token[0])) {
      vector<char> v;
      utf16ToUtf8(&token[0], (int) token.length(), v);
      string s(v.begin(), v.end());
      token = s;
    }
    return token;
  }
  
  void float32ToInt16(const void* data, int size, vector<char>& v) {
    FLOAT *p = (FLOAT *) data;  // 4byte for each data
    v.resize(size/2);
    INT16 *q = (INT16 *) &v[0];    // 2byte for each data
    for (int i=0; i<size/4; i++) {
      *q++ = (INT16) (32767 * (*p++));
    }
    return;
  }

  bool readAll(string path, vector<char>& v) {
    v.resize(0);
    ifstream ifs(path, std::ios::binary);
    if (!ifs) {
      cerr << "can not open " << path << endl;
      return false;
    }
    ifs.seekg(0,std::ios::end);
    size_t size = ifs.tellg();
    v.resize(size);
    ifs.seekg(0,std::ios::beg);
    ifs.read(&v[0],size);
    return true;
  }
  string syncRequest(char* buf, int size,string locale) {
    stringstream ss;
    ss << "{" << endl;
    ss << "  \"config\": {" << endl;
    ss << "    \"encoding\":\"LINEAR16\"," << endl;
    ss << "    \"sampleRateHertz\": 16000," << endl;
    ss << "    \"languageCode\": \"" << locale << "\"," << endl;
    ss << "    \"enableWordTimeOffsets\": false" << endl;
    ss << "  }," << endl;
    ss << "  \"audio\": {" << endl;
    ss << "    \"content\": \"";
    vector<char> v;
    float32ToInt16(buf,size,v);
    vector<char> u;
    base64Encode((char *) &v[0], (int)v.size(),u);
    string str(u.begin(), u.end());
    ss << str;
    ss << "\"" << endl;
    ss << "  }" << endl;
    ss << "}" << endl;
    return ss.str();
  }
  bool utf8ToUtf16(const char *buf, int size, vector<wchar_t>& v) {
    unsigned char *p = (unsigned char *) buf;
    for (int i=0; i<size; i++) {
      UINT16 s0 = ((UINT16) *p++) & 0xff;
      if ((s0 & 0x80) == 0) { // 1 byte
	v.push_back((wchar_t)s0);
      } else {
	if (++i >= size) return false;
	UINT16 s1 = ((UINT16) *p++) & 0xff;
	if ((s1 & 0xc0) != 0x80) return false;
	if ((s0 & 0xe0) == 0xc0) { // 2 byte
	  s0 = ((s0 & 0x1f) << 6) | (s1 & 0x3f);
	  v.push_back((wchar_t)s0);
	} else {
	  if (++i >= size) return false;
	  UINT16 s2 = ((UINT16) *p++) & 0xff;
	  if ((s2 & 0xc0) != 0x80) return false;
	  if ((s0 & 0xf0) == 0xe0) { // 3 byte
	    s0 = ((s0 & 0x0f) << 12) | ((s1 & 0x3f) << 6) | (s2 & 0x3f);
	    v.push_back((wchar_t)s0);
	  } else {
	    if (++i >= size) return false;
	    UINT16 s3 = ((UINT16) *p++) & 0xff;
	    if ((s3 & 0xc0) != 0x80) return false;
	    if ((s0 & 0xf8) == 0xf0) { // 4 byte
	      s0 = (((s0 & 0x07) << 18) | ((s1 & 0x3f) << 12) | ((s2 & 0x3f) << 6) | (s3 & 0x3f)) - (1 << 18);
	      v.push_back((wchar_t) (0xd800 | ((s0 >> 10) & 0x03ff)));
	      v.push_back((wchar_t) (0xdc00 | (s0 & 0x03ff)));
	    } else return false;
	  }
	}
      }
    }
    return true;
  }
  bool isUtf16(const char *buf) {
    unsigned char *p = (unsigned char *)buf;
    UINT16 s0 = ((UINT16) *p++) & 0xff;
    UINT16 s1 = ((UINT16) *p++) & 0xff;
    return (s0 == 0xff && s1 == 0xfe) || (s0 == 0xfe && s1 == 0xff);
  }
  bool utf16ToUtf8(const char *buf, int size, vector<char>& v) {
    v.resize(0);
    unsigned char *p = (unsigned char *)buf;
    UINT16 s0 = ((UINT16) *p++) & 0xff;
    UINT16 s1 = ((UINT16) *p++) & 0xff;
    bool le = false;
    if (s0 == 0xff && s1 == 0xfe) le = true;  // Little Endian
    else if (s0 == 0xfe && s1 == 0xff) le = false; //  Big Endian
    else { cerr << "not utf16" << endl; return false; }
    return utf16ToUtf8(buf+2, size-2, v, le);
  }
  bool utf16ToUtf8(const char *buf, int size, vector<char>& v, bool le) {
    if (size %2 == 1) { cerr << "size is odd" << endl; return false; }
    unsigned char *p = (unsigned char *)buf;
    for (int i=0; i<size/2; i++) {
      UINT16 x;
      UINT16 s0 = ((UINT16) *p++) & 0xff;
      UINT16 s1 = ((UINT16) *p++) & 0xff;
      x = (le == true) ? (s1 << 8) + s0 : (s0 << 8) + s1;
      if (x < 0x80) v.push_back((char) (x & 0x7f));
      else if ((x >> 11) == 0) {
	v.push_back((char)(((x >> 6) & 0x1f) | 0xc0)); // 110..... (10-6)th
	v.push_back((char)((x & 0x3f) | 0x80));        // 10...... (5-0)th
      } else if ((x & 0xf800) != 0xd800) { // Basic Multi-lingual Plane
	v.push_back((char)(((x >> 12) & 0xf) | 0xe0)); // 1110.... (15-12)th
	v.push_back((char)(((x >> 6) & 0x3f) | 0x80)); // 10...... (11-6)th
	v.push_back((char)((x & 0x3f) | 0x80));        // 10...... (5-0)th
      } else { // surrogate pair
	if (++i >= size/2) {
	  cerr << "bad surrogate pair" << endl;
	  return false;
	}
	s0 = ((UINT16) *p++) & 0xff;
	s1 = ((UINT16) *p++) & 0xff;
	UINT16 y = (le == true) ? (s1 << 8) + s0 : (s0 << 8) + s1;
	if ((x & 0xfc00) != 0xd800 || (y & 0xfc00) != 0xdc00) {
	  cerr << "bad surrogate pair" << endl;
	  return false;
	}
	UINT32 z = ((x & 0x3ff) << 10) + (y & 0x3ff) + (1 << 16); // 21 bits
	v.push_back((char)(((z >> 18) & 0x7) | 0xf0));  // 11110... (20-18)th
	v.push_back((char)(((z >> 12) & 0x3f) | 0x80)); // 10...... (17-12)th
	v.push_back((char)(((z >> 6) & 0x3f) | 0x80));  // 10...... (11-6)th
	v.push_back((char)((z & 0x3f) | 0x80));         // 10...... (5-0)th
      }
    }
    return true;
  }
  bool toUtf8(char *buf, int size, vector<char>& v) {
    if (size >= 2 && isUtf16(buf)) {
      utf16ToUtf8(buf,size,v);
      return true;
    } else {
      v.resize(size);
      for (int i=0; i<size; i++) v[i] = buf[i];
      return false;
    }
  }
  void base64Encode(char* buf, int size, vector<char>& v) {
    const char table[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
    v.resize((size/3)*4 + ((size % 3 == 0)? 0 : 4));
    for (int i=0; i<size/3; i++) {
      UINT32 x = ((UINT32)buf[i*3] << 16) + ((UINT32)buf[i*3+1] << 8) + ((UINT32)buf[i*3+2]);
      v[i*4] = table[(x >> 18) & 0x3f];
      v[i*4+1] = table[(x >> 12) & 0x3f];
      v[i*4+2] = table[(x >> 6) & 0x3f];
      v[i*4+3] = table[x & 0x3f];
    }
    if (size % 3 == 0) return;
    int i = size/3;
    UINT32 x = ((UINT32)buf[i*3] << 16) + ((size % 3 == 1) ? 0 : ((UINT32)buf[i*3+1] << 8));
    v[i*4] = table[(x >> 18) & 0x3f];
    v[i*4+1] = table[(x >> 12) & 0x3f];
    v[i*4+2] = (size % 3 == 1) ? '=' : table[(x >> 6) & 0x3f];
    v[i*4+3] = '=';
  }
  string doSyncRequest(string path, string locale="ja-JP") {
    if (accessToken == "") {
      stringstream ss;
      ss << "no access token" << endl;
      throw std::runtime_error( ss.str().c_str() );			\
    }
    vector<char> audiodata;
    bool flag = readAll(path, audiodata);
    if (flag != true) {
      cerr << "can not open audio file: " << path << endl;
      return "";
    }
    string json = syncRequest((char *) &audiodata[WaveHeaderSize], (int)audiodata.size()-WaveHeaderSize,locale);
    stringstream headers;
    headers << "Content-Type: application/json" << endl;
    headers << "Authorization: Bearer " + accessToken << endl;
    headers << "Content-Length: " << json.size() << endl;

    stringstream res;

    HINTERNET session = WinHttpOpen(L"Google Speech API/1.0",
				    WINHTTP_ACCESS_TYPE_DEFAULT_PROXY,
				    WINHTTP_NO_PROXY_NAME,
				    WINHTTP_NO_PROXY_BYPASS, 0);
    if (session == 0) {
      cerr << "WinHttpOpen failed" << endl;
      return res.str();
    }
    HINTERNET conn = WinHttpConnect(session, host.c_str(), INTERNET_DEFAULT_HTTPS_PORT, 0);
    if (conn == 0) {
      cerr << "WinHttpConnect failed" << endl;
      return "";
    }
    HINTERNET req = WinHttpOpenRequest(conn, L"POST",
				       hpath.c_str(), 
				       NULL, WINHTTP_NO_REFERER,
				       WINHTTP_DEFAULT_ACCEPT_TYPES,
				       WINHTTP_FLAG_SECURE);
    if (req == 0) {
      cerr << "WinHttpOpenRequest failed" << endl;
      return "";
    }
    string h = headers.str();
    wstring wh(h.begin(),h.end());
    BOOL result = WinHttpSendRequest(req, wh.c_str(), (DWORD)h.length(),
				     (LPVOID)json.c_str(), (DWORD)json.length(),
				     (DWORD)(h.length()+json.length()), 0);
    if (result == 0) {
      cerr << "WinHttpSendRequest failed" << endl;
      return res.str();
    }
    result = WinHttpReceiveResponse(req, NULL);
    if (result == FALSE) {
      cerr << "WinHttpReceiveResponse failed " << GetLastError() << endl;
      return res.str();
    }

    DWORD dwSize = sizeof(DWORD);
    DWORD dwStatusCode;
    result = WinHttpQueryHeaders(req,WINHTTP_QUERY_STATUS_CODE | WINHTTP_QUERY_FLAG_NUMBER,
			       WINHTTP_HEADER_NAME_BY_INDEX,
			       &dwStatusCode, &dwSize, WINHTTP_NO_HEADER_INDEX);
    res << dwStatusCode << endl;
    if (result == FALSE) {
      cerr << "WinHttpQueryHeaders failed " << GetLastError() << endl;
      return res.str();
    }

    for (;;) {
      DWORD dwSize = 0;
      DWORD dwDL = 0;
      result = WinHttpQueryDataAvailable(req, &dwSize);
      if (result == FALSE) {
	cerr << "WinHttpQueryDataAvailable failes " << GetLastError() << endl;
      }
      if (dwSize == 0) break;
      vector<char> buf(dwSize+1);
      result = WinHttpReadData(req,(LPVOID) &buf[0], dwSize, &dwDL);
      if (result == FALSE) {
	cerr << "WinHttpReadData failes " << GetLastError() << endl;
	return res.str();
      }
      
      string s(buf.begin(), buf.begin()+ dwDL);
      res << s << endl;
    }
    
    WinHttpCloseHandle(req);
    WinHttpCloseHandle(conn);
    WinHttpCloseHandle(session);

    return res.str();
  }
  void initialize(string p) {
    tokenPath = p;
    accessToken = getAccessToken();
  }

public:
  NtGoogleSpeech() { initialize("C:\\Program Files (x86)\\Google\\Cloud SDK\\token-file.txt"); }
  NtGoogleSpeech(string path) { initialize(path); }
  ~NtGoogleSpeech() { }
};

main.cppの内容を以下のように変更します。

main.cpp を自分の環境に合うように変更して下さい。

  NtGoogleSpeech gs("C:\\Users\\nitta\\Documents\\GoogleSpeech\\token-file.txt");

NtGoogleSpeech のコンストラクタに accessToken ファイルへのパスを渡す必要があります。

main.cpp

#include <iostream>
#include <sstream>

#define USE_AUDIO
#include "NtKinect.h"

#include "NtGoogleSpeech.h"

using namespace std;

#include <time.h>
string now() {
  char s[1024];
  time_t t = time(NULL);
  struct tm lnow;
  localtime_s(&lnow, &t);
  sprintf_s(s, "%04d-%02d-%02d_%02d-%02d-%02d", lnow.tm_year + 1900, lnow.tm_mon + 1, lnow.tm_mday,
	    lnow.tm_hour, lnow.tm_min, lnow.tm_sec);
  return string(s);
}

void doJob() {
  NtKinect kinect;
  bool flag = false;
  string filename = "";
  NtGoogleSpeech gs("C:\\Users\\nitta\\Documents\\GoogleSpeech\\token-file.txt");

  std::wcout.imbue(std::locale("")); // for wcout
  while (1) {
    kinect.setRGB();
    if (flag) kinect.setAudio();
    cv::putText(kinect.rgbImage, flag ? "Recording" : "Stopped", cv::Point(50, 50),
		cv::FONT_HERSHEY_SIMPLEX, 1.2, cv::Scalar(0, 0, 255), 1, CV_AA);
    cv::imshow("rgb", kinect.rgbImage);
    auto key = cv::waitKey(1);
    if (key == 'q') break;
    else if (key == 'r') flag = true;
    else if (key == 's') flag = false;
    else if (key == 'u' || key == 'j') {
      if (filename != "") {
	string res = gs.doSyncRequest(filename,(key == 'u')? "en-US" : "ja-JP");
	vector<wchar_t> u16;
	if (gs.utf8ToUtf16(&res[0],(int)res.length(),u16)) {
	  wstring w16(u16.begin(),u16.end());
	  std::wcout << w16 << endl;
	} else {
	  cout << res << endl;
	}
	
	string outname(filename.begin(), filename.end()-4);
	ofstream fout(outname+".txt");
	fout << res;
      }
    }

    if (flag && !kinect.isOpenedAudio()) {
      filename = now() + ".wav";
      kinect.openAudio(filename);
    } else if (!flag && kinect.isOpenedAudio()) kinect.closeAudio();
  }
  cv::destroyAllWindows();
}

int main(int argc, char** argv) {
  try {
    doJob();
  }
  catch (exception &ex) {
    cout << ex.what() << endl;
    string s;
    cin >> s;
  }
  return 0;
}

'r' キーで録音を開始し、's'キーで録音を停止します。ファイル名は、録音を開始した時点の時刻を取得して、それをファイル名(例 "2016-07-18_09-16-32.wav")とする wav ファイルを作成しています。 'j' キーまたは 'u'キーで、直近に録音した音声を Google Speech API に送り、解析結果を .txt という拡張子のファイルに保存します。 'u' では英語 ( "en-US" )として、'j'では日本語 ("ja-JP") として音声認識をします。

音声認識の結果として返ってくるのは utf-8 の文字列です。日本語の認識結果をそのまま表示すると、環境によっては文字化けして見えるかもしれません。

プログラムを実行するとRGB画像が表示されます。'q'キーで終了します。

'r'キーで録音開始、's'キーで録音停止です。 RGB画像の左上に録音状態が "Recording" または "Stopped" と表示されます。

'j'キーまたは'u'キーで、直近の録音データを日本語または英語として音声認識します。

認識結果のjsonの前に表示されている数字 (200 とか 401 とか) は、HTTPアクセスのStatus Codeです。

2017-12-20_18-57-05.wav

2017-12-20_18-57-05.txt

200
{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "good morning",
          "confidence": 0.9117151
        }
      ]
    }
  ]
}

2017-12-20_18-57-21.wav

2017-12-20_18-57-21.txt

200
{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "おはよう",
          "confidence": 1
        }
      ]
    }
  ]
}

Google Speech API のアクセス・トークンの有効期限は非常に短かいので注意して下さい。もしも "UNAUTHENTICATED" という結果が返ってくるようになった場合は、token-file.txt の内容を新しいアクセス・トークンに更新して下さい。

アクセス・トークンが期限切れの場合

401
{
  "error": {
    "code": 401,
    "message": "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
    "status": "UNAUTHENTICATED"
  }
}

サンプルのプロジェクトはこちら KinectV2_GoogleSpeech.zip。

上記のzipファイルには必ずしも最新の NtKinect.h が含まれていない場合があるので、こちらから最新版をダウンロードして差し替えてお使い下さい。

http://nw.tsuda.ac.jp/