New Style of Computer Adaptive Testing for Math and Reading SOLs in Spring 2023

by James C. Sherlock

End-of-year SOLs will be released this summer and are much anticipated to see how well kids are recovering from enforced shutdowns during COVID.

Some readers know the current testing system like the backs of their hands. But nearly all of us took standardized tests in school in which everyone took the same tests with the same questions.

That is not how math and reading SOL tests are designed now in Virginia.

Computer Adaptive Tests (CAT) – the link provides more detail than I will here – use AI algorithms to personalize the test for each student.  SOLs have been computer adaptive for more than a decade.

How a student responds to a question determines the difficulty of the next item. A correct response leads to a more difficult item, while an incorrect response results in the selection of a less difficult item for the student.

CATs more completely test the level of knowledge of each student, not continuing with test questions which are either too easy or too hard for that particular student.

Important changes were made for the tests taken in Spring of 2023.

Based upon legislation in 2021, Spring 2023 tests administered questions on grade level, one level up or one level down depending upon the student’s progress through the earlier parts of the exam.

That seems an improvement, offering a more thorough measure of student learning and potentially being more engaging for each student.

The math and reading blueprints for 2023 SOLs contain information for two types of tests, the online computer adaptive test (CAT) and the traditional test.

A passage-based CAT is a customized assessment where each student receives a unique set of passages and items.

This is in contrast to the traditional test in which all students who take a particular version (paper, large print, or braille) of the test receive the same passages and respond to the same test questions.

All online versions of the Reading and Math Standards of Learning (SOL) test (including audio) are computer-adaptive. Paper tests are given only by exception under specific criteria.

As example of the change, see the Grade 4 Math test blueprint new for 2023:

Beginning in spring 2023, the computer adaptive Standards of Learning tests will include a section of additional items at the end of the test.

The computer algorithm may deliver items one grade level above or one grade level below a student’s current grade based upon the student’s responses to the on-grade-level items.

The Test Scaled Score (0 to 600) and corresponding performance level (i.e., pass/proficient, pass/advanced, fail/basic, fail/below basic) are based upon a student’s performance on the on-grade-level Operational Items only.

The student’s responses to the on-grade-level Operational Items and the Additional Items that may be on grade level, one grade level above, or one grade level below the current grade level will be reflected in the student’s Vertical Scaled Score.

There will inevitably be some kinks the first time this is done at this scale in Virginia. They will be worked out over time. CAT lends itself to matrix sampling, a significant advantage.

It will be interesting to see both the raw pass/fail rates and the assessments of such things as:

  • any net effects on scores compared to past tests, both on the group and individual levels; and
  • any substantive differences between CAT and paper tests.

But it seems right.


Share this article



ADVERTISEMENT

(comments below)



ADVERTISEMENT

(comments below)


Comments

38 responses to “New Style of Computer Adaptive Testing for Math and Reading SOLs in Spring 2023”

  1. James Kiser Avatar
    James Kiser

    So a student who does poorly will continue to get test questions below his or her supposed grade level. That indicates to me that the person should never have been advanced to begin with. But we can’t hold anyone back becauseof all that systemic racism, I suppose.

    1. Nathan Avatar

      That’s not necessarily correct. Adaptive tests are just a more accurate way of testing.

      Let’s imagine we are testing physical ability. Someone is asked to bench press 100 pounds. They fail this test and get a zero.

      This person’s ability isn’t zero, but it’s less than 100, so you try 90 next time. If the person is successful, it might go up to 95 next time. If they fail again, it goes down, etc.

      Another person takes the bench press test at 100 pounds and passes. This person can do 100 pounds, but we don’t know how much more, so the next test is for 110 pounds.

      The goal is to accurately measure someone’s ability. Adaptive questions are designed to do that better than a one size fits all test.

      1. James C. Sherlock Avatar
        James C. Sherlock

        As I wrote, I studied it enough that I believe in the science.

        See https://en.wikipedia.org/wiki/Computerized_adaptive_testing for an excellent description,

        1. Nathan Avatar

          I too am a believer, but as with anything, that doesn’t mean a given implementation will live up to the potential, especially right out of the gate.

          There may be a few bumps in the road early on, but should prove to be much better in the long run.

    2. James C. Sherlock Avatar
      James C. Sherlock

      Actually, if a student gets a couple of the questions below grade level correct, a properly designed CAT will jump him or her back up to the grade level question bank for the next one. Same with all levels of performance. Above grade will jump back down in response to failure.

      That doesn’t get to the rest of your point, but I wanted to clarify,

      1. Lefty665 Avatar
        Lefty665

        With 35 on grade questions and a total of 6 potential off grade questions, at most only 1 in 6 answers can generate a subsequent off grade question.

        Like you, I like the concept, but the implementation seems to be pretty limited. From the post it is not used in tabulating SOL scores and ratings of fail or advanced. Curious.

        1. Nathan Avatar

          “…at most only 1 in 6 answers can generate a subsequent off grade question.”

          Not necessarily. A profile may be created from all 35 questions. That generates the appropriate 6.

          1. Lefty665 Avatar
            Lefty665

            That is not what the post and VDoE video say. They both state that Individual test answers determine the next question asked including up to a total of 6 off-grade questions, and that they are in line on the test, not bunched at the end.

            There are also indications that subsequent on grade questions are chosen based on test answers. It appears to be an iterative process. That would make a lot of sense but again raises questions about comparability to fixed question tests.

            Not that it makes a difference, but that also appears to be the way the Microsoft test you took was constructed.

          2. Nathan Avatar

            Thinking about it, that makes sense. If the important ones were all at the beginning, and others at the end, students would blow them off.

  2. James Wyatt Whitehead Avatar
    James Wyatt Whitehead

    This video illustrates the CAT system for those of us who need to visualize the latest iteration of the SOL test.
    https://www.youtube.com/watch?v=e9XP23hhVmw

    1. Lefty665 Avatar
      Lefty665

      The video shows a student’s score as reflecting answers to all the questions asked, including both above and below grade level yet the post says:

      “The Test Scaled Score (0 to 600) and corresponding performance level (i.e., pass/proficient, pass/advanced, fail/basic, fail/below basic) are based upon a student’s performance on the on-grade-level Operational Items only.”

      I’ve still got some cognitive dissonance.

    2. Lefty665 Avatar
      Lefty665

      The video shows a student’s score as reflecting answers to all the questions asked, including both above and below grade level yet the post says:

      “The Test Scaled Score (0 to 600) and corresponding performance level (i.e., pass/proficient, pass/advanced, fail/basic, fail/below basic) are based upon a student’s performance on the on-grade-level Operational Items only.”

      I’ve still got some cognitive dissonance.

      1. Nathan Avatar

        There are two different scores.

        Test Scaled Score (This is comparable to previous tests)

        Vertical Scaled Score (This is new and has a different purpose)

        https://uploads.disquscdn.com/images/5e972c1a7cad57c6ac9a9ebdfd44f629acb1a0c5ec049e333258f15341cc7a88.jpg

        1. Lefty665 Avatar
          Lefty665

          The potential for variance between the scores is what leaves me with cognitive dissonance.

  3. Nathan Avatar

    I took my first adaptive test about 25 years ago. It was for a technology certification from Microsoft.

    As I proceeded through the test, I found it MUCH more difficult than previous Microsoft certification exams. Initially, I was a bit depressed, thinking that I might fail.

    Well, turns out I didn’t fail at all, but actually did quite well on the adaptive test. I’m guessing that I answered the first questions correctly, so it kept giving me progressively harder and harder questions as I went along. The test was designed to probe the depth of my understanding.

  4. Lefty665 Avatar
    Lefty665

    How will the cut scores be tabulated and adjusted? Seems like a potential briar patch.

    If 35 questions are as valid as 50 in determining SOL why are we using 50? If they are not, why degrade the validity of the SOL testing?

    Note *** says the GA required “additional” test items below and above grade level. Reducing the test questions from 50 to 35 then potentially adding 6 off grade questions is a curious definition of “additional”.

    For example, a sub-grade question: Your pay is now $50. I’m going to reduce your pay to $35 then give you a $6 increase. Your pay has gone up 17%, how happy are you with the additional money?

    I like the idea of testing adapting to testees ability, just have some questions about how it is implemented and what impact it has on both scores and comparability with past SOL tests.

    1. James Wyatt Whitehead Avatar
      James Wyatt Whitehead

      Must we all be reduced by algorithms? I noticed the practice tests that are available do not include the CAT technology. This should be corrected. Teachers need to know how this works and demonstrate the CAT technology to students. Otherwise, it’s another “gotcha game”. I link to the technology enchanced SOL tests. All subjects and grade levels included.
      https://www.doe.virginia.gov/teaching-learning-assessment/student-assessment/sol-practice-items-all-subjects

      1. James C. Sherlock Avatar
        James C. Sherlock

        DOE says it has been giving CAT tests for more than a decade. I took them at their word. Did you not encounter them?

        Students and teachers certainly need to be exposed to CAT on a regular basis, not just on SOLs.

        Nonetheless, having investigated the technology in some depth, both the original methodology and the new above/at/below level questions added at the end of the test make sense to me scientifically.

        All students reportedly get a set of “core” questions, but not the same ones, to establish a baseline before the questions diverge, but I like the concept.

        But you were the recent teacher, not me. What issues do you see?

        1. James Wyatt Whitehead Avatar
          James Wyatt Whitehead

          If CAT was used on the 11th grade US and VA History test I never knew about it. It probably was used on elementary tests as experimental test field questions. This is often done to see how the innovations work before counting those questions in the computing of scores. This could be problematic for SPED and ELL students. I do see your point about the science of CAT testing and the potential usefulness. If this is the new way CAT testing should be done periodically throughout the year by the classroom teacher. The data would be far more useful during the year than at the end of the year. This practice could lead to interventions that produce mastery of content.

          1. Nathan Avatar

            “The data would be far more useful during the year than at the end of the year. This practice could lead to interventions that produce mastery of content.”

            Agreed. Once you have finished the year, it’s too late for that instructor to make changes for those students.

            I can see where the Vertical Scaled Score might be helpful for administrators, however.

            Why vertical scaling?

            A vertical scale is incredibly important, as enables inferences about student progress from one moment to another, e. g. from elementary to high school grades, and can be considered as a developmental continuum of student academic achievements. In other words, students move along that continuum as they develop new abilities, and their scale score alters as a result (Briggs, 2010).

            This is not only important for individual students, because we can track learning and assign appropriate interventions or enrichments, but also in an aggregate sense. Which schools are growing more than others? Are certain teachers better? Perhaps there is a noted difference between instructional methods or curricula? Here, we are coming up to the fundamental purpose of assessment; just like it is necessary to have a bathroom scale to track your weight in a fitness regime, if a governments implements a new Math instructional method, how does it know that students are learning more effectively?

            https://assess.com/vertical-scaling/

          2. Nathan Avatar

            “The data would be far more useful during the year than at the end of the year. This practice could lead to interventions that produce mastery of content.”

            Agreed. Once you have finished the year, it’s too late for that instructor to make changes for those students.

            I can see where the Vertical Scaled Score might be helpful for administrators, however.

            Why vertical scaling?

            A vertical scale is incredibly important, as enables inferences about student progress from one moment to another, e. g. from elementary to high school grades, and can be considered as a developmental continuum of student academic achievements. In other words, students move along that continuum as they develop new abilities, and their scale score alters as a result (Briggs, 2010).

            This is not only important for individual students, because we can track learning and assign appropriate interventions or enrichments, but also in an aggregate sense. Which schools are growing more than others? Are certain teachers better? Perhaps there is a noted difference between instructional methods or curricula? Here, we are coming up to the fundamental purpose of assessment; just like it is necessary to have a bathroom scale to track your weight in a fitness regime, if a governments implements a new Math instructional method, how does it know that students are learning more effectively?

            https://assess.com/vertical-scaling/

          3. James Wyatt Whitehead Avatar
            James Wyatt Whitehead
          4. James C. Sherlock Avatar
            James C. Sherlock

            Thanks for your assistance in this give and take.

      2. Lefty665 Avatar
        Lefty665

        FYI, The link takes me to a page that asks for user name and password to continue.

        1. James Wyatt Whitehead Avatar
          James Wyatt Whitehead

          Fixed it. Try this one. Click the giant yellow bar. It will take you to all the tests.
          https://www.doe.virginia.gov/teaching-learning-assessment/student-assessment/sol-practice-items-all-subjects

          1. Lefty665 Avatar
            Lefty665

            Tku:)

    2. Nathan Avatar

      “Your pay is now $50. I’m going to reduce your pay to $35 then give you a $6 increase. Your pay has gone up 17%, how happy are you with the additional money?”

      Your example is not applicable at all.

      There are 35 operational questions. Only those questions determine the 0 – 600 Test Scaled Score, which is a point in time score comparable to previous SOL tests.

      The 5 field test questions are there just to test the questions for potential future use, not the students.

      The 6 on or off grade level questions appear to be designed for the Vertical Scaled Score. This score is designed to measure a student’s progress over time.

      “Vertical Scaled Score Interpretation for Growth Assessments”

      https://www.k12albemarle.org/our-departments/accountability/assessment/vertical-scaled-score-interpretation-for-fall-2022-growth-assessments

      1. James Wyatt Whitehead Avatar
        James Wyatt Whitehead

        Educational Statistics was my least favorite class in college.

      2. Lefty665 Avatar
        Lefty665

        So a student’s Vertical Scaled Score can vary from the Test Scaled Score? A student may rank pass/proficient Test Scaled Score and pass/advanced or fail/below basic on the Vertical Scaled Score? Seems something of a mixed message.

        1. Nathan Avatar

          They are measuring different things for different purposes.

          The Test Scaled Score is the one we are accustomed to.

          1. Lefty665 Avatar
            Lefty665

            I hear you, but that’s not what the VDoE video JWW linked to shows (I suspact the video is wrong).

            There is also a question of how well results from a 35 variable question test correlates to a 50 fixed question test. The answer almost has to be that it varies.

          2. Nathan Avatar

            “There is also a question of how well results from a 35 variable question test correlates to a 50 fixed question test. The answer almost has to be that it varies.”

            We will see. If so, it’s a temporary cost of converting to a better system. IMHO

          3. Lefty665 Avatar
            Lefty665

            How do you validate that it is “better”? Does a vertical scaled score really tell you much more than a “pass/advanced” or “fail/below” and a numeric score that tells you how far above or below standard you are?

            I like the idea of automated testing. It has several advantages. It is for sure a process improvement, but I’m not convinced it gives results that are profoundly better than what Virginia has been using.

            Does it break the ability to compare to past testing? There certainly have been questions here on BR about changes in the tests and cut scores that impair historical comparisons.

            You didn’t get a CNA certificate with a gold star did you?

          4. Nathan Avatar

            The last part of my comment was “IMHO.”

            That’s In My Humble Opinion.

            You are welcome to disagree, and I see no point in debating the issue at this stage. Better to reingage after we see the scores using the new system.

        2. James Wyatt Whitehead Avatar
          James Wyatt Whitehead

          I can see that happening. A biproduct of the SOL test which is only a minimum standard. In theory a CAT test could challenge the advanced student to see just how sophisticated their understanding of content really is.

  5. Fred Costello Avatar
    Fred Costello

    Much depends on how the CAT test results are reported and used. Will a 4th-grade student who gets a few 4th-grade questions wrong be given a score such as 3.8 if he gets the 3rd-grade questions right? The Internet lists non-quantitative advantages and disadvantages. I can foresee much confusion among those who interpret the test results — especially the parents.

  6. Monica Wright Avatar
    Monica Wright

    This is a positive change. The SAT is also moving to adaptive tests. While homeschooling in VA, I used adaptive tests every year which allowed a seamless transition back to public schools. They’re proven and respected, allow top students to demonstrate their ceiling and all students to leave the assessment in a reasonable period of time, with confidence in tact. Enough with the ‘back in my day nonsense’. These tests are 20+ years old.

  7. Matt Hurt Avatar
    Matt Hurt

    The on/off grade level questions which were added to the tests this year comprise only five or six questions. It is nearly impossible to determine student understanding with that few items. These items are not use to determine the test scaled score (the “SOL” score of 0-600 that determines whether or not the student passed the test), but are used to determine the vertical scaled score (the score used to determine growth).

    While this idea might look good on paper, it provides no meaningful information to parents or teachers. While this requirement does not yield any positive results, it seems not to have caused any negative unintended consequences, unlike the much maligned through year “growth” assessments.

Leave a Reply