Have a look at Kraken which has many state-of-the-art models for both HTR and OCR
They are: their car is just a dog in the actual race. From a pure qualifying pace POV they are a lot better, with hulkenberg being able to get the car into q3 quite consistently.
That's also what makes them seem better than they really are: hulk qualifying in p8 (great) and then tumbles down to p16 by the end of the race (usually because they have to stop more often or at least to worse tires since their tire deg is abismal)
No he doesn't?
Don't get me wrong there are many places where the paper can be wrong (eg fig 1 or their magnetism exceptionally looking more similar to diamagnetism than superconductivity) but you are mixing him up with Ranga Dias who has had a history of data fabrication.
Dias has nothing to do with this paper though.
Not really: you have to keep in mind the amount of expertise and ressources that already went into silicon, as well as the geopolitics and sheer availability of silicon. The closest currently available competitor is probably gallium arsenide. That has a couple of disadvantages compared to silicon
- It's more expensive (both due to economies of scale and the fact that silicon is just much more abundant in general)
- GaAs crystals are less stable, leading to smaller boules.
- GaAs is a worse thermal conductor
- GaAs has no native "oxide" (compare to SiO₂) which can be directly used as an insulator
- GaAs mobilities are worse (Si is 500 vs GaAs 400), which means P channel FETs are naturally slower in GaAs, which makes CMOS structures impossible
- GaAs is not a pure element, which means you get into trouble with mixing the elements
You usually see GaAs combined with germanium substrates for solar panels, but rarely independently of that (GaAs is simply bad for logic circuits).
In short: It's not really useful for logic gates.
Germanium itself is another potential candidate, especially since it can be alloyed with silicon which makes it interesting from an integration point-of-view.
SiGe is very interesting from a logic POV considering its high forward and low reverse gain, which makes it interesting for low-current high-frequency applications. Since you naturally have heterojunctions which allow you to tune the band-gap (on the other hand you get the same problem as in GaAs: it's not a pure element so you need to tune the band-gap).
One problem specifically for mosfets is the fact that you don't get stable silicon-germanium oxides, which means you can't use the established silicon-on-insulator techniques.
Cost is also a limiting factor: before even starting to grow crystals you have the pure material cost, which is roughly $10/kg for silicon, and $800/ kg for germanium.
That's why, despite the fact that the early semiconductors all relied on germanium, germanium based systems never really became practical: It's harder to do mass production, and even if you can start mass production it will be very expensive (that's why if you do see germanium based tech, it's usually in low-production runs for high cost specialised components)
There's some research going on in commercialising these techniques but that's still years away.
Zeiss is German, they also produce substantially more than just the optics https://en.m.wikipedia.org/wiki/Carl_Zeiss_SMT
I mean it's very easy to side with SAGA and WGA considering he literally has no other option (all major studios are union).
The "adequate covering" of our distribution p
is also pretty self-explanatory: We don't need to see the statement "elephants are big" a thousand times to learn it, but we do need to see it at least once:
Think of the p
distribution as e.g. defining a function on the real numbers. We want to learn that function using a finite amount of samples. It now makes sense to place our samples at interesting points (e.g. where the function changes direction), rather than just randomly throwing billions of points against the problem.
That means that even if our estimator is bad (i.e. it can barely distinguish real and fake data), it is still better than just randomly sampling (e.g. you can say "let's generate 100 samples of law, 100 samples of math, 100 samples of XYZ,..." rather than just having a big mush where you hope that everything appears).
That makes a few assumptions: the estimator is better than 0% accurate, the estimator has no statistical bias (e.g. the estimator didn't learn things like "add all sentences that start with an A", since that would shift our distribution), and some other things that are too intricate to explain here.
Importantly: even if your estimator is bad, it is better than not having it. You can also manually tune it towards being a little bit biased, either to reduce variance (e.g. let's filter out all HTML code), or to reduce the impact of certain real-world effects (like that most stuff on the internet is english: you may want to balance that down to get a more multilingual model).
However, you have not note here that these are LANGUAGE MODELS. They are not everything models.
These models don't aim for factual accuracy, nor do they have any way of verifying it: That's simply not the purview of these systems.
People use them as everything models, because empirically there's a lot more true stuff than nonsense in those scrapes and language models have to know something about the world to e.g. solve ambiguity, but these are side-effects of the model's training as a language model.
If you have a model that produces completely realistic (but semantically wrong) language, that's still good data for a language model.
"Good data" for a language model does not have to be "true data", since these models don't care about truth: that's not their objective!
They just complete sentences by predicting the next token, which is independent of factuallity.
There are people working on making these models more factual (same idea: you bias your estimator towards more likely to be true things, like boosting reliable sources such as wikipedia, rather than training on uniformly weighted webscrapes), but to do that you need a lot more overview over your data, for which you need more efficient models, for which you need better distributions, for which you need better estimators (though in that case they would be "factuallity estimators").
In general though the same "better than nothing" sentiment applies: if you have a sampling strategy that is not completely wrong, you can still beat completely random sample models. If your estimator is good, you can substantially beat them (and LLMs are pretty good in almost everything, which means you will get pretty good samples if you just sample according to the probability that the LLM tells you "this data is good")
For actually making sure that the stuff these models produce is true, you need very different systems that actually model facts, rather than just modelling language. Another way is to remove the bottleneck of machine learning models with respect to accuracy (i.e. you build a model that may be bad, but can never give you a wrong answer):
One example would be vector-search engines that, like search engines, retrieve information from a corpus based on the similarity as predicted by a machine learning model. Since you retrieve from a fixed corpus (like wikipedia) the model will never give you wrong information (assuming the corpus is not wrong)! A bad model may just not find the correct e.g. wikipedia entry to present to you.
Yes: keep in mind that with "good" nobody is talking about the content of the data, but rather how statistically interesting it is for the model.
Really what machine learning is doing is trying to deduce a probability distribution q
from a sampled distribution x ~ p(x)
.
The problem with statistical learning is that we only ever see an infinitesimally small amount of the true distribution (we only have finite samples from an infinite sample space of images/language/etc....).
So now what we really need to do is pick samples that adequately cover the entire distribution, without being redundant, since redundancy produces both more work (you simply have more things to fit against), and can obscure the true distribution:
Let's say that we have a uniform probability distribution over [1,2,3]
(uniform means everything has the same probability of 1/3).
If we faithfully sample from this we can learn a distribution that will also return [1,2,3]
with equal probability.
But let's say we have some redundancy in there (either direct duplicates, or, in the case of language, close-to duplicates):
The empirical distribution may look like {1,1,1,2,2,3} which seems to make ones a lot more likely than they are.
One way to deal with this is to just sample a lot more points: if we sample 6000 points, we are naturally going to get closer to the true distribution (similar how flipping a coin twice can give you 100% tails probability, even if the coin is actually fair. Once you flip it more often, it will return to the true probability).
Another way is to correct our observations towards what we already know to be true in our distribution (e.g. a direct 1:1 duplicate in language is presumably a copy-paste rather than a true increase in probability for a subsequence).
<continued in next comment>
At this point it's in you if you believe this grifter.
It really depends on what you want: I really like obsidian which is cross-platform and uses basically vanilla markdown which makes it easy to switch should this project go down in flames (there are also plugins that add additional syntax which may not be portable, but that's as expected).
There's also logseq which has much more bespoke syntax (major extensions to markdown), but is also OSS meaning there's no real danger of it suddenly vanishing from one day to the next.
Specifically Logseq is much heavier than obsidian both in the app itself and the features it adds to markdown, while obsidian is much more "markdown++" with a significant part of the "++" coming from plugins.
In my experience logseq is really nice for short-term note taking (e.g. lists, reminders, etc) and obsidian is much nicer for long-term notes.
Some people also like notion, but i never got into that: it requires much more structure ahead of time and is very locked down (it also obviously isn't self-hosted). I can see notion being really nice for people that want less general note-taking and more custom "forms" to fill out (e.g. traveling checklists, production planning, etc..).
Personally, I would always go with obsidian, just for the piece of mind that the markdown plays well with other markdown editors which is important for me if I want a long-running knowledge base.
Unfortunately I cannot tell you anything with regards to collaboration since I do not use that feature in any note-taking system
I can just go to the search tab and look for the magazine (e.g. Search for retro gaming) and find an the other instances.
I think s fair number of people forget to switch the search to magazines before looking (or are actually subscribing to other instances but don't notice it)
I think that interest is added, specifically you pay https://www.irs.gov/payments/quarterly-interest-rates based on the quarters you were missing taxes.
Not sure if the numbers quoted here include that already