AVSubtitles - Forum

Machine captioning SRT testing

2021-07-03 04:26:14

Hi, I am currently working on a personal project to automate captioning of Jap AV files from Japanese to english. I am a native english speaker hence ability to decipher Japanese as a language will be an issue. Is there anyone here preferably with some knowledge of the Japanese language who wishes to collaborate? Essentially the process will be 1. To identify a japanese AV title 2. Me to caption it and provide 2 files: Jap.srt and Eng.srt 3. You to proof read it.

Re: Machine captioning SRT testing

javabeanies

2021-07-03 04:27:28

Sample English SRT 1 00:00:01,66 --> 00:00:41,75 Hi hi yeah yeah yeah yeah it's hot yeah yeah on the way home yeah 2 00:00:42,13 --> 00:00:51,95 Also, I'm still in Japan after eating something like that, so I went home during the summer vacation. 3 00:00:51,95 --> 00:00:57,24 I think I've just been around at home, but I'm glad 4 00:00:57,52 --> 00:01:04,62 You're at home Yeah Well, we're having a festival meeting from now on 5 00:01:04,62 --> 00:01:11,77 I'm gonna go to make rice for my grandfather on the back 6 00:01:11,77 --> 00:01:18,15 It's not free anyway, that old man is also old, and sometimes he has a wait-and-see

Re: Re: Machine captioning SRT testing

javabeanies

2021-07-03 04:27:56

Sample Jap SRT 1 00:00:01,66 --> 00:00:41,75 ひひふんええうんございますうんはあ暑いわね帰りにうんああ 2 00:00:42,13 --> 00:00:51,95 またそんななんか食べてまだ日本はある私はだからね全く夏休みで帰省 3 00:00:51,95 --> 00:00:57,24 したと思えば家でごろごろしてばかりなんだからけど良かった 4 00:00:57,52 --> 00:01:04,62 あんたが家にいてくれてうんまあこれから祭の打ち合わせがあってあんた 5 00:01:04,62 --> 00:01:11,77 裏のお爺ちゃんのご飯作りに行ってきなよなんで私がいい 6 00:01:11,77 --> 00:01:18,15 じゃないどうせヒマでしょあのお爺ちゃんも年だしえきたまに様子見が

Re: Re: Re: Machine captioning SRT testing

swierszczyk69

2021-07-03 23:55:08

The idea may be cool, but I doubt anyone here knows Japanese

Re: Re: Re: Re: Machine captioning SRT testing

javabeanies

2021-07-04 01:19:46

Yea. Jus giving it a shot. Any popular translation besides Jap to Eng? I can give it a shot to translate it to English.

Re: Re: Re: Re: Re: Machine captioning SRT testing

truc1979

2021-07-04 11:56:23

Hello, I've learnt Japanese for 2 years at school (as well as German and Spanish), but I forgot most of them. I'm totally unable to get a dialog, however, I can still catch some words. For JAV, it's often enough to notice the google speech recognition is still unable to get Japanese. Your project is very interesting, you should test it by auto-captioning a movie which already has subtitles and compare the results. I think I did something similar: ffmpeg + google api. I tried German, Italian and Japanese. The latter was by far the worst result. German & Italian results wasn't so bad, enough to understand the main topic, but needed a lot of corrections to make the dialogs fully understandable. Actually, your project may be the same some of us (non native english speaker) did to caption English movies. As I said before, you should try on an already subtitled movie, or on a short movie sample first (if a very short sample, I can help sometimes, depending on my free time). Else, in adult movies, beside English and Japanese, German and Italian are very popular languages.

Re: Re: Re: Re: Re: Re: Machine captioning SRT testing

javabeanies

2021-07-04 18:07:28

Yea will test tat direction out. That said biggest issue with Jap to Eng is really language construct sov Vs svo structure. For Spanish, Italian etc, accuracy is definitely better.

Re: Re: Re: Re: Re: Re: Re: Machine captioning SRT testing

truc1979

2021-07-05 10:27:40

> language construct sov Vs svo structure. Yes, but not only. Speech recognition doesn't mark silence, punctuation, etc... and Japanese in its writed form doesn't use spaces between words. Thus, you have to check when the dialog ends. Look at index 5 and 6 from your example: 裏のお爺ちゃんのご飯作りに行ってきなよなんで私がいいじゃないどうせヒマでしょあのお爺ちゃんも年だしえきたまに様子見が I'm gonna go to make rice for my grandfather on the back It's not free anyway, that old man is also old, and sometimes he has a wait-and-see Now, from what I understand, if I "split" the sentences like this: ...裏の - お爺ちゃんのご飯作りに行ってきなよ - なんで? - 私がいいじゃない - どうせヒマでしょあのお爺ちゃんも年だしえきたまに様子見が... Then the translation becomes: ...on the back. - Don't go to make your grandfather's rice - Why? - I don't like him - Anyway, that old man is also old, and sometimes he waits... So as you can see, even if the speech recognition is perfect (but it's never perfect...) the proof reading is very important and takes a lot of time... From which movie did you take that sample?

Re: Re: Re: Re: Re: Re: Re: Re: Machine captioning SRT testing

javabeanies

2021-07-05 10:45:39

that's a interesting perspective. So far I am able to do a speaker identification. However once it's pushed to SRT, that separation of speaker dissappears I am using this JAV as reference to get a sense of the accuracy of my transcription and translation. https://www.avsubtitles.com//movie.php?movid=2214

Re: Re: Re: Re: Re: Re: Re: Re: Re: Machine captioning SRT testing

javabeanies

2021-07-05 11:24:56

Using Nima-007, there are 2 ways to process the transcribed file. 1. Via timeline This approach loses the speaker identification 2. Via speaker identification This approach loses the timeline identification but results in a more understanding translation