RUBY Encoding format

For general discussion related FlowStone
Drnkhobo
Posts: 312
Joined: Sun Aug 19, 2012 7:13 pm
Location: ZA

RUBY Encoding format

Post by Drnkhobo »

Hey guys, im trying to parse some txt files for emails & I get an error within Ruby:

Encoding: (in method 'event')::ConverterNotFoundError: code converter not found (ASCII-8BIT to UTF-8)

Now, I can parse emails one by one which its fine, but as soon as I try to do a whole directory it gives this error. I am going through each one and adding them to an array. . . take a look:

Code: Select all

@db = []

Dir.foreach('C:\S') do |item|

next if item == '.' or item == '..'
newfile = 'C:/S/' << item

f =  File.open(newfile)
content = f.read
   
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)     
emails = content.scan(r).uniq   
                                 
addy = emails[-1]
@db << addy
output @db
 
end


So as I said, it works fine until I try to parse a lot of files. It gets through about 15 or so before stating the error. I have just exported my emails from thunderbird so they are .eml files.

Its weird because it goes through like 15 before spitting out the error. :?

I have checked online and tried a few solutions to no avail. Does anyone here have experience with the Encoder functions? Why give an error after doing a few then stopping?
tulamide
Posts: 2714
Joined: Sat Jun 21, 2014 2:48 pm
Location: Germany

Re: RUBY Encoding format

Post by tulamide »

You said you tried something from the web without specifying what exactly. So excuse me, if this is not helpful.

This will inform ruby of interpreting as utf-8 without touching the byte sequence of the file

Code: Select all

content = f.read.force_encoding("utf-8")


Another chance might be that the byte sequence is read in wrong. This will do a conversion, but then again tell ruby to handle it as utf-8

Code: Select all

content = f.read.encode("iso-8859-1").force_encoding("utf-8")


Lastly a double conversion might help

Code: Select all

content = f.read.encode("iso-8859-1").encode("utf-8")



ASCII-8bit shows that somehow the data is interpreted as binary instead of string. But I don't know why.
"There lies the dog buried" (German saying translated literally)
Drnkhobo
Posts: 312
Joined: Sun Aug 19, 2012 7:13 pm
Location: ZA

Re: RUBY Encoding format

Post by Drnkhobo »

Thanks Tulamide :D

I will give it a go now now, I did try to force encoding but your solution looks like it might just work for me (how do I know :roll: :lol: )

Its weird that if I put 10 emails in my folder it does its job fine without complaining. As soon as its more than that, the error spits up. Strange. . .
tulamide
Posts: 2714
Joined: Sat Jun 21, 2014 2:48 pm
Location: Germany

Re: RUBY Encoding format

Post by tulamide »

Ok, I thought about the weird behavior. Don't laugh: Are you sure, the folder contains nothing else than human readable text files? Maybe there's some hidden system file among them or another binary?
I ask because your code does not protect against reading those in.

To be absolutely sure, you could extend the if statement with

Code: Select all

next if item == '.' or item == '..' or item[-3, 3].casecmp("eml") != 0
"There lies the dog buried" (German saying translated literally)
User avatar
trogluddite
Posts: 1730
Joined: Fri Oct 22, 2010 12:46 am
Location: Yorkshire, UK

Re: RUBY Encoding format

Post by trogluddite »

tulamide wrote:ASCII-8bit shows that somehow the data is interpreted as binary instead of string. But I don't know why.

This might give you a clue...

Code: Select all

Encoding.name_list
#=> ["ASCII-8BIT", "UTF-8", "US-ASCII"]

If I do the same thing on my standard Windows Ruby 1.9.3 install, I get 99 different available encodings!
So, the FS version of Ruby has had huge amounts of the Encoding class ripped out. I can only guess that this is partly to enforce compatibility with the "green" strings, which are always ASCII, and maybe just to reduce the Ruby interpreter size for embedding into exports. It is very annoying - I had hoped that Ruby could be used to allow proper Unicode support, but it seems not!
All schematics/modules I post are free for all to use - but a credit is always polite!
Don't stagnate, mutate to create!
tulamide
Posts: 2714
Joined: Sat Jun 21, 2014 2:48 pm
Location: Germany

Re: RUBY Encoding format

Post by tulamide »

trogluddite wrote:This might give you a clue...

It does :mrgreen:
Tbh, I didn't even think of checking it. I just assumed a fully functional ruby. :cry:

@Drnkhobo
Ignore the last two "content" examples. They won't work under these circumstances. And let me know, if one of the other tips helped you.
"There lies the dog buried" (German saying translated literally)
tester
Posts: 1786
Joined: Wed Jan 18, 2012 10:52 pm
Location: Poland, internet

Re: RUBY Encoding format

Post by tester »

I spoke to Malc about the problem with international characters (editboxes in edit state and difference between string prim vs text prim) because I can't display some things proper in my language, and he said he will check it. Since FS is expanding into less audiomatic areas - I think this issue may land on the priority list.
Need to take a break? I have something right for you.
Feel free to donate. Thank you for your contribution.
User avatar
trogluddite
Posts: 1730
Joined: Fri Oct 22, 2010 12:46 am
Location: Yorkshire, UK

Re: RUBY Encoding format

Post by trogluddite »

Would be a valuable thing for users of any language - it would open the doors for working with many kinds of external documents, and most of the Windows API assumes multi-byte character encodings as standard. For example, I have had to use some really ugly hacks to make the DLLs I'm working on call Windows functions successfully when they have string arguments.
All schematics/modules I post are free for all to use - but a credit is always polite!
Don't stagnate, mutate to create!
User avatar
JB_AU
Posts: 171
Joined: Tue May 21, 2013 11:01 pm

Re: RUBY Encoding format

Post by JB_AU »

Don't forget attachments, eml can contain binary attachments.
"Two things are infinite: the universe and human stupidity; and I'm not sure about the the universe."

Albert Einstein
User avatar
JB_AU
Posts: 171
Joined: Tue May 21, 2013 11:01 pm

Re: RUBY Encoding format

Post by JB_AU »

I'm not 100% sure what sequence of events you need.

So its possible to set Preference file to view as html or plain text, 1 less conversion step.
Retrieve mail, parse tags, dump it, do stuff.

The long range drone (50km away) i parse sms as (plain text)
Parse the sender tags, for a specific sender id.
Then set predefined condition by subject tag.

I'm not using Ruby that i know of!

Hope this helps?
"Two things are infinite: the universe and human stupidity; and I'm not sure about the the universe."

Albert Einstein
Post Reply