glossed over in most papers and research. "we just did this, because
it worked". Encoding, many times doesn't get any mention.
I will have to run some more big data tests. My pos class is about
1:1. The bias becomes 100% towards the negative class. I don't have
able to pull a large enough set. I now understand exactly what you
before.
of my attributes with has about 400+ categories. I've been
saving a key to translate future bins. PCA makes no sense for a 1ofC
scenario. If I can come up with or find a better technique for 1ofC
compression, a big data test may be fruitful.
intensive.
Post by Josh MenkeHi Conor,
I have some experience in these areas, but not a huge amount.
I haven't had a whole lot of luck doing the 1:1 approach and using
cost/prior matrices. Especially in those 99.99% situations. I found just
adding a ton more data worked a lot better than balanced + multipliers
afterwards.
I "sort-of" believe that what you really need is "enough" examples for each
class, whether it be 2-class or a large 1-of-C situation.
I have seen good results from either using "a lot" of data and or from
"balancing" the data in large, sparse 1-of-C situations.
But I have also seen 1:1 approaches with 2-classes and cost-matrices fail
miserably compared to just using a lot more data.
This tells me that if you want to use something other than the true
distribution, it's not a trivial thing to find which distribution you should
use.
So in short, the simplest approach is to have a lot of data, enough that you
represent even the less frequent classes. This way you can maintain the
prior in the data, and still learn each class. I think this works best and
is only time and space intensive.
Failing that, it's trial and error, which is "resource" intensive where YOU
the scientist are the resource. This is because you have to try and
"engineer" distributions of the classes that will fit your needs. This may
be "balanced" or it may be something between balanced and the true
distribution. But finding that "sweet spot" may be more costly in terms of
YOUR time, than just getting more data or making your training handle more
data that you already have.
I hope I haven't just rambed too much.
--Josh
Post by Conor RobinsonI agree with you on the fact that more data is better, however was not
aware of the study you mentioned in (2).
For the 2-class problem you mentioned where one of your classes is
99.99% as with many problems such as tumor detection. Would you not
duplicate your .01% class to 1:1 with your second class, for example,
in your training batch (thus really needing more than 4G in some
cases)? I try and keep my distributions as accurate as possible,
however, I find apply cost matrices and other methods post training
much less effective. No matter how large your data set gets, I don't
see the network becoming more effective unless youre changing your
ratio. As for your second point, I guess increasing the size of your
network would depend on your data, many times I find larger networks
becoming more prone to over fit.
What are your thoughts on encoding very large 1ofC categories for
neural nets, even with a very large data set, you encounter sparse
areas, that impede training. What types of intelligent compression
might be effective for 1ofC?
Thanks for your thoughts.
Conor
Post by Josh MenkeThere are a couple of reasons in general for wanting a very large data
set
Post by Conor RobinsonPost by Josh Menke1. Like you mentioned, sparse data. But I also mean sparse as in the
target
Post by Conor RobinsonPost by Josh Menkeclasses may have very few members compared to the whole population. For
example, a concept learning (2-class) problem where one class represents
99.99% of the population. In this case, if you want to both have enough
data
Post by Conor RobinsonPost by Josh Menketo learn the concept AND automatically infer the prior distributions
correctly, then having a lot of data is an easy way to go.
2. If you have a very difficult problem, then research has shown that
neural
Post by Conor RobinsonPost by Josh Menkenetworks have an uncanny ability to continue to improve accuracy by
using
Post by Conor RobinsonPost by Josh Menkemore and more data and larger and larger networks. A group out of ICSI
at
Post by Conor RobinsonPost by Josh MenkeBerkely showed this a few years back for large-scale speaker-independent
phoneme recognition. They were using MASSIVE speech corpora and showing
the
Post by Conor RobinsonPost by Josh Menkeaccuracy kept increasing at a rate worth the cost.
--Josh
Post by Conor RobinsonIf your running an intel chip, I used icc and recompiled, worked fine.
The real question is, do you need to run a data set that big? If
your inputs are very wide and/or sparse you might run into real
problems, the 'curse of dimensionality'. It all depends on your data
set, but a random sample may be more practical. You could get better
results with x-fold validation. I'm curious as to what kind of data
your looking at. I don't think recompiling fann for 64bit should
cause you any trouble, good luck.
Post by Josh MenkeAs a quick work around, I wrote a batch in SAS that split an 8 GB
data
each
enough
all
Post by Conor RobinsonPost by Josh MenkePost by Conor RobinsonPost by Josh MenkePost by Mark Knechtactually.) If you have some code and a large data set then I'd be
happy to try building it and seeing if it runs.
Contact me off-line if that's of interest.
Cheers,
Mark
Post by Poul-Erik AndreasenHi
There have earlier been some discussion about datasets larger
than
far
if
-------------------------------------------------------------------------
your
Apache
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
https://lists.sourceforge.net/lists/listinfo/fann-general
-------------------------------------------------------------------------
job
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
https://lists.sourceforge.net/lists/listinfo/fann-general
-------------------------------------------------------------------------
Post by Conor RobinsonPost by Josh MenkePost by Conor RobinsonPost by Josh MenkeUsing Tomcat but need to do more? Need to support web services,
security?
Post by Conor RobinsonPost by Josh MenkeGet stuff done quickly with pre-integrated technology to make your
job
Post by Conor RobinsonPost by Josh MenkePost by Conor RobinsonPost by Josh Menkeeasier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache
Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
https://lists.sourceforge.net/lists/listinfo/fann-general
-------------------------------------------------------------------------
Post by Conor RobinsonPost by Josh MenkePost by Conor RobinsonUsing Tomcat but need to do more? Need to support web services,
security?
Post by Conor RobinsonPost by Josh MenkePost by Conor RobinsonGet stuff done quickly with pre-integrated technology to make your job
easier.
Post by Conor RobinsonDownload IBM WebSphere Application Server v.1.0.1 based on Apache
Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
Post by Conor RobinsonPost by Josh MenkePost by Conor Robinson_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
Post by Conor RobinsonPost by Josh Menke--
Joshua Menke
Statistician, Machine Learning Scientist
TnS Detection Platforms
ebay, Inc
-------------------------------------------------------------------------
Post by Conor RobinsonPost by Josh MenkeUsing Tomcat but need to do more? Need to support web services,
security?
Post by Conor RobinsonPost by Josh MenkeGet stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache
Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
Post by Conor RobinsonPost by Josh Menke_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
-------------------------------------------------------------------------
Post by Conor RobinsonUsing Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job
easier.
Post by Conor RobinsonDownload IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
Post by Conor Robinson_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Joshua Menke
Statistician, Machine Learning Scientist
TnS Detection Platforms
ebay, Inc
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
Get stuff done quickly with pre-integrated technology to make your job easier.