- AuthorPosts
- February 28, 2025 at 6:21 am #30181
WmMitchell
ParticipantI have a several step process in cleaning up transcripts I save from YouTube how-to videos. (I save these as It’s always so much easier for following along if I have the instructions in text format I can consult.) However, my regexp ability is quite limited and my workarounds here don’t always yield best results so I waste time each time cleaning up results.
I’ll try to outline … Here is an example of part of a transcript of the kind I like to save:
INTRO 0:01 Oil is filled into the engine through the filler neck, 0:08 which is located in the valve cover of the engine. 0:20 From the head, the oil is drained through the return channels, through the engine block, into the sump. PART 1 0:52 The dipstick checks the amount of oil filled. 0:56 The oil level should not go beyond the lower and upper marks. 1:25 The helical gear of the crankshaft drives the oil pump and distributor. 1:31 Oil from the sump, through the intake tube, is pumped into the oil pump. 1:42 The oil pump outlet is divided into two channels. PART 2 1:52 One channel leads to the oil filter. And the second channel leads to the pressure reducing 1:58 valve of the pump. If the oil pressure at the outlet of the pump exceeds 6 kilograms per centimeter,
I use a couple of macros I made that puts all the timestamps and text on the same line and adds a tab in between time and text. But it’s a macro and cumbersome.
How could one do the following:
1. Remove hard return from timestamp and put a tab (I already do that, but it doesn’t handle about 9m59s)
– so0:01 Oil is filled into the engine through the filler neck,
becomes (with 2 tabs after because it’s under 10m):
0:01{TAB}{TAB}Oil is filled into the engine through the filler neck,
and the rest, i.e. …
0:08{TAB}{TAB}which is located in the valve cover of the engine. 0:20{TAB}{TAB}From the head, the oil is drained through the return channels, through the engine block, into the sump.
****BUT**** for all lines 10m and over, to add only ONE tab, i.e.:
10:09{TAB}It is installed in the second groove, on the valve stem. Valve seals, 10:15{TAB}in this engine, are not provided from the factory. 10:40{TAB}The valve disc washer limits the amount of oil that gets on the springs. 10:53{TAB}The sealing rubber ring blocks the direct penetration of oil, through crackers,
I know this is unusual, but the app that I always paste this text into has built in tab stops I cannot edit, so the space between the timestamp and text of all of the ones under 10m is very, very small and makes the text quite difficult to read. And using timestamps and text from downloaded subtitle files (SRT files) is useless as subtitle format isn’t easy to read. So editing the transcripts is best format to work with which I’ve used for several years now.
2nd issue is the problem when there are often “subtitle” breaks in the transcript. i.e., I added INTRO, PART 1 and PART 2 in the transcript above as an example.)
Ideally, before these snippets of text, there’s 3 hard returns and 2 hard returns after the subtitle text.I believe both things (tabs after timestamps, and # of hard-returns before and after added “subtitles”) can be done via regex, but I just haven’t had any luck beyond what I’ve done in a macro that is tedious and doesn’t produce reliable results 100% of the time.
Thank you for any help in this regard!
February 28, 2025 at 8:42 am #30182Patrick C
ParticipantFirst a suggestion:
Consider replacing the timestamps below 10 minutes:
0:xx
with
00:xx
This will make the number of tabs consistent across all times below 100 minutes.The regex for that is:
Find^0:
Replace00:
Then replace the newlines after the timestamps with a tab.
Find(?<=^\d\d:\d\d)\n
Replace\t
Should you need assistance on what the regex above does:
https://regex101.com => paste the regex and a transcript.Hope this helps.
Patrick—
Remarks:
1) I’m assuming that you have no leading or trailing whitespace.
2) Find/Replace Options → Advanced button: Boost engine with all options unchecked, additional lines is set to 0.February 28, 2025 at 9:10 am #30183Patrick C
ParticipantEeek noticed a mistake.
The following is wrong.
Find ^0:
Replace 00:Instead:
Find^(?=\d:)
Replace0
Sorry.
March 1, 2025 at 5:21 am #30186WmMitchell
ParticipantThank you! The thread responses just brought it home to me that I should handle it one thing at a time, as this is quite tricky and I’m not that good with regexp.
Yikes, the comment above regarding the “100 minutes” reminded me that, though rare, I do save transcripts for longer videos (conferences and the odd long tutorial) that are well over 100m, being over 2 hours and above!
So the first code (with or without the ^ symbol) :
document.selection.StartOfDocument(false); document.selection.Replace("^0:","00:",eeFindNext | eeReplaceAll | eeFindReplaceRegExp); document.selection.StartOfDocument(false);
produced this type of listing (after I’d run the tabs macro):
00:24 organizing transformations thank you 00:35 [Music] thank you 3:14 they tend to pile till later and they leave out all their favorite things they're always on the go and my client 14:20 you're not using it one time a week it's going to be real you have a ton of storage you don't know I don't feel like 38:48 really maximize the vertical space so I got a lot of different sizes I'm lining them up in the driveway so I can sort 1:02:46 letting go was all they needed to make a really big change
And the “corrected” code of:
document.selection.StartOfDocument(false); document.selection.Replace("^(?=\\d:)","0",eeFindNext | eeReplaceAll | eeFindReplaceRegExp); document.selection.StartOfDocument(false);
yielded this:
00:24 organizing transformations thank you 00:35 [Music] thank you 03:14 they tend to pile till later and they leave out all their favorite things they're always on the go and my client 14:20 you're not using it one time a week it's going to be real you have a ton of storage you don't know I don't feel like 38:48 really maximize the vertical space so I got a lot of different sizes I'm lining them up in the driveway so I can sort 01:02:46 letting go was all they needed to make a really big change
So good for all time stamps starting with a zero, but doesn’t cover the ones beginning with 1-5 (1m-59m, and I imagine, 2 or 3.
If we can take care of this time format issue first, the rest should be easier (hoping).
Example timestamps:0:24 organizing transformations thank you 0:35 [Music] thank you 3:14 they tend to pile till later and they leave out all their favorite things they're always on the go and my client 14:20 you're not using it one time a week it's going to be real you have a ton of storage you don't know I don't feel like 38:48 really maximize the vertical space so I got a lot of different sizes I'm lining them up in the driveway so I can sort 1:02:46 letting go was all they needed to make a really big change 2:03:33 ... 3:09:12 ...
Thank you!!
March 3, 2025 at 1:13 am #30187Patrick C
ParticipantPerhaps the following approach is easier for you:
Find:^(\d:\d\d)\n
selects x:xx located at the beginning of the line (^)
Replace:\1\t\t
Re-pastes xx:xx (\1) and adds two tabs (\t)
OR:0\1\t
Pastes 0 followed by xx:xx (\1) and adds one tab (\t).Find:
^(\d\d:\d\d)\n
selects xx:xx located at the beginning of the line (^)
Replace:\1\t
Re-pastes xx:xx (\1) and adds one tab (\t)Find:
^(\d:\d\d:\d\d)\n
selects x:xx:xx located at the beginning of the line (^)
Replace:\1\t
Re-pastes x:xx:xx (\1) and adds one tab (\t)Adapt according to your needs.
To replace all expressions in one single go:
Use EmEditor’s batch replace feature (Find / Replace dialogue box → Batch >>).
Batch replace also allows saving and loading your find/replace definitions (import / export). - AuthorPosts
- You must be logged in to reply to this topic.