Discussion:
Input data as strings? I am a BioInformatics research student.
Nathan TeGrotenhuis
2009-11-06 09:46:08 UTC
Permalink
Hello, I am a student interested in using fann for a bioinformatics
project.  I have compiled fann and it seems to be working fine, please
correct me if I am wrong, but it seems to me that strings are not
acceptable input data.  There doesn't seem to be anything in the
documentation about acceptable data input types, but it seems like
only numbers will work.  I think this project is great and I am
willing to contribute code to help it work with string input, unless I
am just ignorant and the neural networks created by fann are
inherently incapable of learning properties of strings.

anyway, for debugging purposes, I am including my training data, the
training program is the same as the one in the XOR tutorial, except
that I changed num_input to 1. The idea is for the network to learn to
recognize sentences containing "God".

7 1 1
since_calcGodulating_this_will_require_to_go_through_the_entire_training_set_once_more,_it_is_more_than_adequate_to_use_this_value_during_tr
1
A_US_Army_major_has_opened_fire_on_fellow_soldiers_at_the_Fort_Hood_military_base_in_Texas,_killing_13_people_and_injuring_30,_officials_say
-1
The_United_States_imposes_high_anti-dumping_tariffs_God_on_Chinese_pipes_as_trade_disputes_mar_the_run-up_to_a_bilateral_summit.
1
GodCambodia_recalls_its_ambassador_from_Thailand_in_tit-for-tat_dispute_over_sanctuary_offer_to_former_Thai_PM_Thaksin.
1
A_gunman_in_Japan_has_killed_himself_after_wounding_three_people_in_Yokohama,_outside_Tokyo,_police_say.
-1
Police_named_the_gunman_as_Kenji_Hayashi,_a_62-year-old_member_of_the_Inagawa-kai,_a_largeGod_Japanese_organised_crime_group._
1
An_electric_car_created_by_ex-McLaren_Formula_One_designer_Gordon_Murray_has_been_unveiled.
-1

--
Nate
Fernando Jiménez Solano
2009-11-06 10:12:27 UTC
Permalink
Hello.
Post by Nathan TeGrotenhuis
The idea is for the network to learn to
recognize sentences containing "God".
You don't want to use ANN for that, you want regular expressions.

http://en.wikipedia.org/wiki/Regular_expression

Regards.
Nathan TeGrotenhuis
2009-11-06 18:15:22 UTC
Permalink
Thank you for your response,

I am aware that regular expressions are the correct tool for finding
substrings. The reason for this program is to see if fann is able to find
patterns in strings. The goal is to be able to classify peptide sequences
according a particular property of the enzyme for which the sequence codes.

The training set is sequences for enzymes that are known to be either
thermophilic or non-thermophilic. Hopefully, the ann will learn to recognize
whether or not the sequence codes for a thermophilic sequence. The
experiment set is sequences for which the property is not known.

So you see that I really do want to use fann, the finding "God" problem is
just a simple experiment for me to learn to use the network.

I have tried mapping each character to an integer, but now the number of
input nodes is not constant. Is it possible to create a network with a
variable number of input nodes?

good day.
---nate
Post by Fernando Jiménez Solano
Hello.
Post by Nathan TeGrotenhuis
The idea is for the network to learn to
recognize sentences containing "God".
You don't want to use ANN for that, you want regular expressions.
http://en.wikipedia.org/wiki/Regular_expression
Regards.
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus
on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Nate
M.Ranji
2009-11-06 18:28:07 UTC
Permalink
Assuming this is one of your raw sequences (Emiliana huxleyi in this case):
ggtccggtcggattccgggatatcgtcgacccacgcgtccgctagttctagatcgcgagcggccgcccttttttttttttttttctcgggcccgggtcggctcaggagagccccccggacagccgcgcgctccacgcgaacgcggagcccgcgacggggttagacggggtacggtgcaacatcggtgtgggttggaaagaccggtaatgatccttccgcaggttcacctacggaaaccttgttacgacttctccttcctctaaatgataaggttcggacagcttcccgcggcgtcgcggctggagaaccagctgcggcgccgcagtccgggggcctcaccggatcattcaatcggtaggagcgacgggcggtgtgtacaaagggcagggacgtaatcaacgtgcgctgatgacacacgcttactaggaattcctcgttgaagattaatagttgcaataatctatccccatcacgatgcaatttcaaaagattacccggacctctcggtcaaggtgatagactcgttgagtgcatcagtgtagcgcgcgtgcggcccagaacatctaagggcatcacagacctgttattgccgcgaacttccacttgttgaagacaagttgtccctctaagaagctccagcgaacggagggttcgcgtcgctatttagcaggctgcggtctcgttcgttaacggaattaaccagacaaatcactccaccaactaagaacggccatgcaccaccacccatcgaatcaagaaagagctctcaatctgtcaatcctcacaatgtctggacctggtaagttttcccgggttgagtcaaattaagccgcaggctccactcctggtggtgcccttccgtcaatccctttagtttcagccttgcgaccatactccccccggaacccaaagactttagtttcccgaaaggtgctgaaggagcccaaatgggaacatcctccaatcc
tagtcggc

have you tried using the entire sequence as one input along with "Direction" and other properties of the sequence?  You shouldn't try to map each char to be an input node if that's what you are doing.

- Mohammad


--- On Fri, 11/6/09, Nathan TeGrotenhuis <groceryheist-***@public.gmane.org> wrote:

From: Nathan TeGrotenhuis <groceryheist-***@public.gmane.org>
Subject: Re: [Fann-general] Input data as strings? I am a BioInformatics research student.
To: "FANN General and development discussion" <fann-general-***@public.gmane.orgge.net>
Date: Friday, November 6, 2009, 10:15 AM

Thank you for your response,

I am aware that regular expressions are the correct tool for finding substrings.  The reason for this program is to see if fann is able to find patterns in strings.  The goal is to be able to classify peptide sequences according a particular property of the enzyme for which the sequence codes. 

The training set is sequences for enzymes that are known to be either thermophilic or non-thermophilic. Hopefully, the ann will learn to recognize whether or not the sequence codes for a thermophilic sequence. The experiment set is sequences for which the property is not known.


So you see that I really do want to use fann, the finding "God" problem is just a simple experiment for me to learn to use the network.

 I have tried mapping each character to an integer, but now the number of input nodes is not constant.  Is it possible to create a network with a variable number of input nodes?


good day.
---nate

2009/11/6 Fernando Jiménez Solano <fernandojs-KQycZriHgL4/***@public.gmane.org>

Hello.
Post by Nathan TeGrotenhuis
The idea is for the network to learn to
recognize sentences containing "God".
You don't want to use ANN for that, you want regular expressions.



http://en.wikipedia.org/wiki/Regular_expression



Regards.



------------------------------------------------------------------------------

Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day

trial. Simplify your report design, integration and deployment - and focus on

what you do best, core application coding. Discover what's new with

Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________

Fann-general mailing list

Fann-general-5NWGOfrQmneRv+***@public.gmane.org

https://lists.sourceforge.net/lists/listinfo/fann-general
--
Nate


-----Inline Attachment Follows-----
Nathan TeGrotenhuis
2009-11-06 23:33:54 UTC
Permalink
So far, all I have done is the try to find god problem. My project works
with amino acid sequences rather than with genetic sequences, but that
difference should not matter to the neural network. The problem is that when
I run the training program,
I get this error:
FANN Error 10: Error reading info from train data file "God.data", line: 2.

The God.data contains:

8 1 1
ThisfunctionreturnstheMSEerrorasitiscalculatedeitherbeforeorduringtheactualtraining.ThisisnottheactualMSEafterthetrainingepoch,but
-1
sincecalcGodulatingthiswillrequiretogothroughtheentiretrainingsetoncemore,itismorethanadequatetousethisvalueduringtraining.
1
AUSArmymajorhasopenedfireonfellowsoldiersattheFortHoodmilitarybaseinTexas,killing13peopleandinjuring30,officialssay.
-1
TheUnitedStatesimposeshighanti-dumpingtariffsGodonChinesepipesastradedisputesmartherun-uptoabilateralsummit.
1
GodCambodiarecallsitsambassadorfromThailandintit-for-tatdisputeoversanctuaryoffertoformerThaiPMThaksin.
1
AgunmaninJapanhaskilledhimselfafterwoundingthreepeopleinYokohama,outsideTokyo,policesay.
-1
PolicenamedthegunmanasKenjiHayashi,a62-year-oldmemberoftheInagawa-kai,alargeGodJapaneseorganisedcrimegroup.
1
Anelectriccarcreatedbyex-McLarenFormulaOnedesignerGordonMurrayhasbeenunveiled.
-1

As far as I can tell, the problem is that the input data cannot be
characters. Is this the case?
ggtccggtcggattccgggatatcgtcgacccacgcgtccgctagttctagatcgcgagcggccgcccttttttttttttttttctcgggcccgggtcggctcaggagagccccccggacagccgcgcgctccacgcgaacgcggagcccgcgacggggttagacggggtacggtgcaacatcggtgtgggttggaaagaccggtaatgatccttccgcaggttcacctacggaaaccttgttacgacttctccttcctctaaatgataaggttcggacagcttcccgcggcgtcgcggctggagaaccagctgcggcgccgcagtccgggggcctcaccggatcattcaatcggtaggagcgacgggcggtgtgtacaaagggcagggacgtaatcaacgtgcgctgatgacacacgcttactaggaattcctcgttgaagattaatagttgcaataatctatccccatcacgatgcaatttcaaaagattacccggacctctcggtcaaggtgatagactcgttgagtgcatcagtgtagcgcgcgtgcggcccagaacatctaagggcatcacagacctgttattgccgcgaacttccacttgttgaagacaagttgtccctctaagaagctccagcgaacggagggttcgcgtcgctatttagcaggctgcggtctcgttcgttaacggaattaaccagacaaatcactccaccaactaagaacggccatgcaccaccacccatcgaatcaagaaagagctctcaatctgtcaatcctcacaatgtctggacctggtaagttttcccgggttgagtcaaattaagccgcaggctccactcctggtggtgcccttccgtcaatccctttagtttcagccttgcgaccatactccccccggaacccaaagactttagtttcccgaaaggtgctgaaggagcccaaatgggaaca
tcctccaatcctagtcggc
Post by M.Ranji
have you tried using the entire sequence as one input along with
"Direction" and other properties of the sequence? You shouldn't try to map
each char to be an input node if that's what you are doing.
Post by M.Ranji
- Mohammad
Subject: Re: [Fann-general] Input data as strings? I am a BioInformatics research student.
To: "FANN General and development discussion" <
Date: Friday, November 6, 2009, 10:15 AM
Thank you for your response,
I am aware that regular expressions are the correct tool for finding
substrings. The reason for this program is to see if fann is able to find
patterns in strings. The goal is to be able to classify peptide sequences
according a particular property of the enzyme for which the sequence codes.
Post by M.Ranji
The training set is sequences for enzymes that are known to be either
thermophilic or non-thermophilic. Hopefully, the ann will learn to recognize
whether or not the sequence codes for a thermophilic sequence. The
experiment set is sequences for which the property is not known.
Post by M.Ranji
So you see that I really do want to use fann, the finding "God" problem is
just a simple experiment for me to learn to use the network.
Post by M.Ranji
I have tried mapping each character to an integer, but now the number of
input nodes is not constant. Is it possible to create a network with a
variable number of input nodes?
Post by M.Ranji
good day.
---nate
Post by Fernando Jiménez Solano
Hello.
Post by Nathan TeGrotenhuis
The idea is for the network to learn to
recognize sentences containing "God".
You don't want to use ANN for that, you want regular expressions.
http://en.wikipedia.org/wiki/Regular_expression
Regards.
------------------------------------------------------------------------------
Post by M.Ranji
Post by Fernando Jiménez Solano
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Nate
-----Inline Attachment Follows-----
------------------------------------------------------------------------------
Post by M.Ranji
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
-----Inline Attachment Follows-----
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
------------------------------------------------------------------------------
Post by M.Ranji
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Nate
Nathan TeGrotenhuis
2009-11-06 23:37:51 UTC
Permalink
Although, it looks like some strings are on more than one line, I have
made sure that they are not.

On Fri, Nov 6, 2009 at 3:33 PM, Nathan TeGrotenhuis
So far, all I have done is the try to find god problem.  My project works
with amino acid sequences rather than with genetic sequences, but that
difference should not matter to the neural network. The problem is that when
I run the training program,
FANN Error 10: Error reading info from train data file "God.data", line: 2.
8 1 1
ThisfunctionreturnstheMSEerrorasitiscalculatedeitherbeforeorduringtheactualtraining.ThisisnottheactualMSEafterthetrainingepoch,but
-1
sincecalcGodulatingthiswillrequiretogothroughtheentiretrainingsetoncemore,itismorethanadequatetousethisvalueduringtraining.
1
AUSArmymajorhasopenedfireonfellowsoldiersattheFortHoodmilitarybaseinTexas,killing13peopleandinjuring30,officialssay.
-1
TheUnitedStatesimposeshighanti-dumpingtariffsGodonChinesepipesastradedisputesmartherun-uptoabilateralsummit.
1
GodCambodiarecallsitsambassadorfromThailandintit-for-tatdisputeoversanctuaryoffertoformerThaiPMThaksin.
1
AgunmaninJapanhaskilledhimselfafterwoundingthreepeopleinYokohama,outsideTokyo,policesay.
-1
PolicenamedthegunmanasKenjiHayashi,a62-year-oldmemberoftheInagawa-kai,alargeGodJapaneseorganisedcrimegroup.
1
Anelectriccarcreatedbyex-McLarenFormulaOnedesignerGordonMurrayhasbeenunveiled.
-1
As far as I can tell, the problem is that the input data cannot be
characters.  Is this the case?
Post by M.Ranji
ggtccggtcggattccgggatatcgtcgacccacgcgtccgctagttctagatcgcgagcggccgcccttttttttttttttttctcgggcccgggtcggctcaggagagccccccggacagccgcgcgctccacgcgaacgcggagcccgcgacggggttagacggggtacggtgcaacatcggtgtgggttggaaagaccggtaatgatccttccgcaggttcacctacggaaaccttgttacgacttctccttcctctaaatgataaggttcggacagcttcccgcggcgtcgcggctggagaaccagctgcggcgccgcagtccgggggcctcaccggatcattcaatcggtaggagcgacgggcggtgtgtacaaagggcagggacgtaatcaacgtgcgctgatgacacacgcttactaggaattcctcgttgaagattaatagttgcaataatctatccccatcacgatgcaatttcaaaagattacccggacctctcggtcaaggtgatagactcgttgagtgcatcagtgtagcgcgcgtgcggcccagaacatctaagggcatcacagacctgttattgccgcgaacttccacttgttgaagacaagttgtccctctaagaagctccagcgaacggagggttcgcgtcgctatttagcaggctgcggtctcgttcgttaacggaattaaccagacaaatcactccaccaactaagaacggccatgcaccaccacccatcgaatcaagaaagagctctcaatctgtcaatcctcacaatgtctggacctggtaagttttcccgggttgagtcaaattaagccgcaggctccactcctggtggtgcccttccgtcaatccctttagtttcagccttgcgaccatactccccccggaacccaaagactttagtttcccgaaaggtgctgaaggagcccaaatgggaaca
tcctccaatcctagtcggc
have you tried using the entire sequence as one input along with
"Direction" and other properties of the sequence?  You shouldn't try to map
each char to be an input node if that's what you are doing.
- Mohammad
Subject: Re: [Fann-general] Input data as strings? I am a BioInformatics
research student.
To: "FANN General and development discussion"
Date: Friday, November 6, 2009, 10:15 AM
Thank you for your response,
I am aware that regular expressions are the correct tool for finding
substrings.  The reason for this program is to see if fann is able to find
patterns in strings.  The goal is to be able to classify peptide sequences
according a particular property of the enzyme for which the sequence codes.
The training set is sequences for enzymes that are known to be either
thermophilic or non-thermophilic. Hopefully, the ann will learn to recognize
whether or not the sequence codes for a thermophilic sequence. The
experiment set is sequences for which the property is not known.
So you see that I really do want to use fann, the finding "God" problem is
just a simple experiment for me to learn to use the network.
 I have tried mapping each character to an integer, but now the number of
input nodes is not constant.  Is it possible to create a network with a
variable number of input nodes?
good day.
---nate
Post by Fernando Jiménez Solano
Hello.
Post by Nathan TeGrotenhuis
The idea is for the network to learn to
recognize sentences containing "God".
You don't want to use ANN for that, you want regular expressions.
http://en.wikipedia.org/wiki/Regular_expression
Regards.
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Nate
-----Inline Attachment Follows-----
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
-----Inline Attachment Follows-----
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Nate
--
Nate
Everardo Robredo
2009-11-08 09:45:49 UTC
Permalink
I'm not sure that giving raw text or even a numeric representation of a
chain as an input is a good idea... How long is the largest sequence?
(aminoacids)

When you create have to leave a space between each input in the training
file if you want the ANN to check letter by letter. Also, you have to use
numbers but that shouldn't be an issue.

The real problem comes from the ammount of inputs you would have to give
(one for each letter in the chain) and the fact that you have to specify the
number of inputs at the begining of the training file so if you have protein
chains of different lengths as inputs you are going to be forced to fill the
training pattern with something.

Have you considered working with thermodynamical properties of aminoacids
instead of the aminoacid identity itself? That would reduce the ammount of
inputs although the length difference issue would remain.
--
Everardo
Everardo Robredo
2009-11-19 08:06:04 UTC
Permalink
Hi,

I want to use SOM for pattern recognition and I was wondering when is the
2.2 version going to be realeased. I hope it doesn't take long...
--
Everardo
Loading...